idea for music recognition, conversion and composition using artificial neural networks

I had this idea while walking the kids to school. Starting from a simple network that can classify music styles as rock/metal/classical/folk/etc, I think that it would be possible to adapt the same algorithm to convert a music file from one style to another, and even write music from scratch in whatever style you want. And if I’m right, I think it would be very simple to write.

Recognition

This is the simplest task. To recognise the style of a music file, all you need is a feed-forward network with a few thousand inputs, at least one hidden layer, and one output for each style you want to recognise.

A standard data rate for recorded music is 160kbps. That means that every second, there are 10,240 separate wave heights (160*1024/16) that need to be examined. Of course, you can recognise music using lower bps values, but let’s use the same setting for the whole process (160 will be wanted for later parts).

So, the input layer would need 10,240*n inputs, where n is the number of seconds you want the network to sample in order to determine the style. In some cases (metal/classical), you may get away with sampling just a single second, but for better results, you might want a larger value. I’ll be setting n to 300, so it samples the entire song in most cases. This makes it easier to be accurate about the result, but will also be useful in a later stage.

The output layer needs to have one node per tag you want to measure. For example, you might have an output that measures how “rock” a song is, and another that measures how “baroque” it is. You could use output nodes that return a simple Yes/No result, but there is a good reason to return a more linear certainty instead (which we’ll get to).

The hidden network needs at least one neuron, obviously, but I don’t think there is any way to say exactly how many it needs, so it would be better to use a network model which grows automatically as it learns (I don’t know the technical term – I just build the things!).

After building the network, you need to train it. This is the easiest part – you just need a large database of music, and tags for every one of those tunes.

One handy idea: if you’re training a 5 second network (for example), then a 3 minute song has at least 36 completely separate training sets for you to sample – all you need to do is start linking to the inputs at second 0, 1, 2, .5, etc, and the network will see what it thinks (initially) is a completely different data set.

After training this for a while, you should be able to run a few seconds of a song through the network and have fairly accurate results of how “funk” or “jazz” a song is.

Conversion

After figuring out the above, I started thinking of alternative uses for the idea, and one surprising idea took hold.

Let’s say that you have a “folk” song played on guitar and violin. How would you go about making it “metal”? You could start by fuzzing the violin and distorting the guitar, and maybe adding some drums in.

I think it would be possible to write a program which lets you convert a song from one style to another literally at the click of a button.

Remember I mentioned that the output neurons should say how metal/classical/etc a song is, not just that it is or is not.

If the network is written with enough precision, then adjusting one or more of the input values should give a different value in the outputs.

As an example, let’s say you have a folk tune that you want to convert to neo-punk. Adjusting the inputs such that the sounds are more distorted (clipping high values, for example), or faster (shifting later inputs to the left, maybe) might change the tune’s “neo-punk” output from 0.00024 to 0.00025.

If you repeat this over and over (automatically, obviously), discarding changes that reduce the output and repeating changes that increase the output, until the “neo-punk” output reaches an acceptable threshold such as .9, then you have just created an automatic way to convert a tune from one style to another.

I think this has a lot of applications. For example, let’s say you want to convert a piano tune to guitar? You train your network to recognise what piano and guitar tunes sound like, and then simply convert as above!

Composition

This may be the simplest of the lot.

After creating the above programs, try inputting a sound sample of pure static into the conversion program, and tell it to convert the static to piano. I think it would come up with some interesting tunes. Maybe not completely accurate tunes, but they would be interesting.

I think the network would automatically learn rules about harmony and rhythm, but don’t think it would learn about structure. For example, you could train a network to recognise a 3/4 rhythm, but I don’t know if you could write something that recognises a fugue.

ToDo

List of things off the top of my head that I want to do:

  • write a book. already had a non-fiction book published, but I’d love to have an interesting an compelling original fiction idea to write about. I’m working on a second non-fiction book at the moment.
  • master a martial art. I have a green belt in Bujinkan Taijutsu (ninja stuff, to the layman), but that’s from ten years ago – found a Genbukan teacher only a few days ago so I’ll be starting that up soon (again, ninja stuff).
  • learn maths. A lot of the stuff I do involves guessing numbers or measuring. it’d be nice to be able to come up with formulas to generate optimal solutions.
  • learn electronics. what /is/ electricity? what’s the difference between voltage and amperage? who knows… I’d like to.
  • create a robot gardener. not just a remote-control lawn-mower. one that knows what to cut, what to destroy, that can prune bushes, till the earth, basically everything that a real gardener does.
  • rejuvenate, or download to a computer, whichever is possible first. science fiction, eh? you wait and see…
  • create an instrument. I’m just finishing off a clavichord at the moment. when that’s done, I think I’ll build another one, based on all the things I learned from the first. followed by a spinet, a harpsichord, a dulcimer, and who knows what else.
  • learn to play an instrument. I’m going for grades 2 and 3 in September for piano. I can play guitar pretty well, but would love to find a classical teacher.
  • write a computer game. I have an idea, based on Dungeon Keeper, for a massively multiplayer game. maybe I’ll do it through facebook…
  • write programs to:
    • take a photo of a sudoku puzzle and solve it. already wrote the solver.
    • take a photo of some sheet music and play it.
    • show some sheet music on screen, compare to what you’re playing on a MIDI keyboard, and mark your effort.
    • input all the songs you can play on guitar/keyboard. based on the lists of thousands of people, rate all these songs by difficulty, to let you know what you should be able to learn next.
    • input a job and your location. have other people near you auction themselves to do the job for you. or vice versa: input your location, and find all jobs within walking distance to you where you can do an odd job for some extra cash (nearly there: http://oddjobs4locals.com).
    • takes a photo and recognises objects in it (partly done)
    • based on above, but can also be corrected and will learn from the corrections (also partly done)
  • stop being damned depressed all the time.

There’s probably a load of other stuff, but that’s all I’ve got at the moment!

visual sudoku solver

The plan for this one is that, if you’re doing a sudoku puzzle in the pub or on the train, and you get stuck, you just take a snapshot of the puzzle with your camera-phone, send the photo to a certain mobile number, and a few seconds later the solution is sent back as an SMS message. The solution costs you something small – 50 cents, maybe.

How it works is that the photo gets sent via SMS to a PHP server, which manipulates the photo in a small number of steps:

  1. Find the corners of the puzzle.
  2. Extract the numbers and identify them.
  3. Also identify whether the numbers were printed or hand-written.
  4. Solve the puzzle with the printed numbers.

In the above, the only difficult step is actually the first one. I’m still trying to figure that one out. I think it will involve a bit of maths, which should be interesting.

For identifying the numbers, you just need a small neural net. This is much easier than full-blown OCR, because in OCR, you have the additional problem of trying to identify what is a letter, what’s a word, and what’s empty space. In the Sudoku puzzle, though, there is a well-defined grid, and a certainty that each square contains just one number to be identified.

Identifying whether a number was handwritten or printed should also be easy – the colour of the figure will give it away. The colour of the puzzle frame will match the printed numbers, but will be slightly off from the hand-written figures. Of course, it won’t be as simple as that in practice, but that’s what I’ll be testing first.

Actually solving the puzzle is simple – I’ve already written a javascript sudoku solver, so it’s just a matter of porting it to PHP.

lesson learned – recurrent networks are sequential

Okay – I was watching paint drymy neural network learning today and something surprising happened about an hour into the sequence – suddenly, instead of getting the tests 90%+ correct, it was getting them almost 100% incorrect!

It took me a few moments to realise what had happened – for some reasons, I had decided to try teach the network the difference between two objects for a lot of sessions before adding a third ingredient. I thought that would teach the network to very finely know the difference between them.

It turns out, though, that the network was smarter than me. It learned that the answer to the current test is… the exact opposite of whatever the previous answer was. The damn thing learned to predict my tests!

So – what I’ve done to correct this, is that ta the beginning of any test, I now erase the neural memory in the network. In effect, I’m whacking the network on the side of the head so it can’t learn to predict my sequences and must learn the actual photos instead.

ANN progress

Last week, I posted a list of goals for my neural network for that weekend. I’ve managed most of them except for the storing of the network.

Didn’t do it last weekend because I was just plain exhausted. After writing the post, I basically went to sleep for the rest of the weekend.

demo (Firefox only)

The demo is not straightforward. Usually, I create a demo which is fully automatic. In this case, you need to work at it. In this case, you need to enter an image URL (JPG-only for now), a name (or names) for a neuron which the photo should answer “yes” to, and similar for “no”, then click “add image”.

The network then trains against that image and will alert you when it’s done. The first one should complete pretty much instantly. Subsequent ones are slower to learn (in some cases, /very/ slow) as they need to fit into the network without breaking the existing learned responses.

I’ve put a collection of images here. The way I do it is to insert one image from Dandelion, enter “dandelion” for yes and “grass,foxtail” for no. Then when that’s finished, take an image from Foxtail and adapt the neuron names, then take one from Grass, and similar. Over time, given enough inputs, I think the network would train well enough to learn new images almost as fast as the first one.

Onto the problems… I spent all weekend on this. First problem is with the sheer number of inputs – with an image of 160×120 size, there are 19200 pixels. With RGB channels, that’s 57600!

So the first solution was to read in the RGB channels and change them to greyscale – a simple calculation – that cut the inputs by two thirds.

Then there was the speed – JavaScript is pretty fast these days, but it’s nowhere near as fast as C, Java or even Flash. If a loop runs for too long, an annoying pop-up appears saying “your browser appears to have stopped responding”. So, I needed to cut the training algorithms apart so they worked in a more “threaded” fashion. This involved a few setTimout replacements for some for(...) loops.

Another JS problem had to do with the <canvas>. There is no “getPixel()” yet for Canvas, so I get to rely on getImageData which returns raw data. According to the WhatWG specs, you cannot read the pixels of a Canvas object which has ever held an image from an external server. So, I needed to create a simple Proxy script to handle that. (just thought of a potential DOS for that – will fix in a few minutes)

Another problem had to do with the network topology. Up until now, I was using networks where there were only inputs and outputs – no hidden neurons. The problem is that when you train a network to recognise something in a photo based on the actual image values, you are relying on the exact pixels, and possibly losing out on quicker tricks such as combining inputs from neurons which say “is there a circle” or “are there parallel lines”. Also, networks where all neurons can read from all inputs are slow. A related problem here is that it is fundamentally impossible to know exactly what shortcut neurons would give you the right answer (otherwise there would be no need for a neural network!).

So, I needed to separate the inputs from the outputs with a bunch of hidden neurons. The way I did this was to have a few rules – inputs connect to nothing, hidden neurons connect to everything, outputs connect to everything except inputs.

The hope with this is that the outputs would read from the hidden neurons, which would pick up on some subtle undefinable traits they spot in the inputs. So the outputs would be almost like a very simple network “overlaid” on the more complex hidden neurons.

Problem is – how do you train these hidden units in a meaningful way? I’m still working on this… Right now, it’s based on “expectation” – the output neuron, when it produces a result, is either right or wrong. If the neuron is then to be trained, then it tells the hidden units that it relies on what output it expected from those units in order to come to the correct conclusion. The hidden units wait until there have been a number of these corrections pointed out to them, then it adjusts accordingly. (two thoughts just occurred to me – 1; try adjusting the hidden neurons after every ‘n’ corrections (instead of at arbitrary points), and 2; only correct a hidden unit if the output neuron’s result was incorrect.)

Another thing was that there is no way of knowing how many hidden units are actually required to have the network produce the right results, so the way I get around this is to train the network, and every (hidden units * 10) cycles, if a solution has still not been found, add another hidden network. This means that at the beginning of training, hidden units will be added pretty regularly, but as the network matures, new neurons will be added at rarer periods.

I’ve caught the ANN bug. After spending the weekend on this, I still have more ideas to work on which I’ll probably do during the week’s evenings. Some ideas:

  • Store successful networks after training.
  • Have a number of tests installed from “boot” instead of needing them to be entered manually.
  • Disable the test entry form when in training mode.
  • Work some more on the hidden unit training methodology – it’s still way off.
  • Allow already-inputed training sets to be edited (for example, if you want to add “rose” to the “no” list of “grass”).

I think that a good network will have a solid number of hidden neurons which do not change very often. The outputs will rely on feedback from each other as well as those hidden units. For example, “foxtail” could be gleaned from a situation where the “dandelion” is a definite no, and “grass” is a vague yes, along with some input from the hidden units approximating to “whitish blobs at the top” (over-simplification, yes, but I think that’s approximately it).

update It turns out the network learns very very well given just two neurons to learn – I’ve managed to get the network to learn almost every image in “dandelions” and “grass” with only 4 hidden units and 2 output neurons.

today’s ANN goals

I’ve been doing well over the last two weeks – I started with an ANN which can balance a pole, and upped that by then creating a net which could recognise letters.

The plan for today is a bit more ambitious. I’m writing it down here in case it takes longer than a day to write it. In general, I want to write a PHP application which will allow you to upload images, which the ANN will then try to recognise. If it gets it right, all well and good. If it gets it wrong, you can correct it.

Some milestones for the project:

  • readers can upload images and have them tested and/or added to the training sequence
  • extraction of image pixels using <canvas>
  • automatic creation of new neurons as they are required
  • best net is stored on server, so new readers always start with a working net

I think this may grow into a damned cool thing – I’m already thinking of other cool features like distributed nets, or background nets which can be placed in other pages of a site so the thing can continue training even though the user is not actively viewing the thing (that might be a bit cheeky though).

Anyway – now that I’ve written what I intend to do, I suppose I’d better actually do it.

learning how to learn

Yesterday’s attempts at ANN training seemed at first to be successful, but I had overlooked one simple put curious flaw – there was training going on all the time, and each test was run 10 times… This means that each neuron was trying different values all the time until it got the right one, then it would be the next neuron’s turn (a bit of a simplified answer, but I don’t know how to describe what was actually happening). This ended up causing the tests to look a lot more successful than they actually were.

This morning, I did a lot of work figuring out how to get around the problem. It turns out the problem is not with the neural network – that appears to be working perfectly. The problem is with the method of training.

Just like with people, you cannot just throw a net into a series of 26 tests which it has never seen before and expect it to learn it any time soon. For any particular neuron, 25 of the tests will have “No” as the answer, and it is too easy for the neuron to just answer “No” to everything and get a 96+% correct answer.

Instead, you need to start with just one test, keep trying until that’s right, then add another test, keep going until they’re both right, then add another test, etc.

Even that was not enough, though – it turns out that the capital letters “B” and “E” are very similar to each other, as are “H” and “M” (at least, in my sequence, they are).

I managed to improve the learning of the neurons by following rules such as these:

  • If a test’s answer is “No” but the neuron says “Yes” (ie; has a returned value above 0), then adjust the weights of the neuron it (correct/punish it).
  • If a test’s answer is “Yes”, and the neuron is anywhere less than 75% certain of Yes, then adjust/reward the neuron.

In all other cases (neuron is certain of Yes and is right, or neuron is vaguely sure of No and is right) leave the neuron’s weights alone.

This has helped to avoid the problem where neurons get extremely confident of Yes or No and are hard to correct when a similar test to a previously learned one comes along (O and Q for example).

It’s still not perfect, but perfection takes time…

letter recognition network

Last week, I wrote a neural network that could balance a stick. That was a simple problem which really only takes a single neuron to figure out.

This week, I promised to write a net which could learn to recognise letters.

demo

For this, I enhanced the network a bit. I added a more sensible weight-correction algorithm, and separated the code (ANN code).

I was considering whether hidden inputs were required at all for this task. I didn’t think so – in my opinion, recognising the letter ‘I’ for example, should depend on some information such as “does it look like a J, and not like an M?” – in other words, recognising a letter depends on how confident you are about whether other values are right or wrong.

The network I chose to implement is, I think, called a “simple recurrent network” with stochastic elements. This means that every neuron reads from every other neuron and not itself, and corrections are not exact – there is a small element of randomness or “noise” in there.

The popular choice for this kind of test is a feed-forward network, which is trained through back-propagation. That network requires hidden units, and each output (is it N, it it Q) is totally ignorant of the other outputs, which I think is a detriment.

My network variant has just finished a training run after 44 training cycles. That is proof that the simple recurrent network can learn to recognise simple letters without relying on hidden units.

Another interesting thing about the method I used is how the training works. Instead of throwing a huge list of tests at the network, I have 26 tests, but only a set number of them are run in each cycle depending on how many were gotten right up until then. For example, a training cycle with 13 tests will only be allowed if the network previously successfully passed 12 tests.

There are still a few small details I’d want to be sure about before pronouncing this test an absolute success, but I’m very happy with it.

Next week, I hope to have this demo re-written in Java, and a new demo recognising flowers in full-colour pictures (stretching myself, maybe…).

As always, this has the end goal of being inserted in a tiny robot which will then do my gardening for me. Not a mad idea, I think you’re beginning to see – just a lot of work.

update As I thought, there were some points which were not quiet perfect. There was a part of the algorithm which would artifically boost the success of the net. With those deficiencies corrected, it takes over 500 cycles to get to 6 correct letters. I think this can be improved… (moments later – now only takes 150+ to reach 6 letters)

neural net balancing thing

Last century, when I worked for Orbism, a co-worker, Olivier Ansaldi (now working for Google) showed me a Java applet he was working on which learned how to balance a stick using a neural net.

I decided to try it myself, and wrote a neural net that does it yesterday.

demo (Firefox only)

There is a total of 3 neurons in the net – 1 bias, 1 input (stick angle) and 1 output.

It usually takes about 20 iterations to train the net. Sometimes, it gets trained in such a way that the platform waggles back and forward like a drunk, and sometimes it gets trained so perfectly that it’s damn boring towatch (basically, it’s a platform with a stationary stick on it).

Anyway… for my next trick, I’ll try building a net which can recognise letters and numbers.

Partly the point of this was that it was an itch I wanted to scratch. But also, I wanted to write it in C++ but am better at JavaScript so wrote it in JS to test it first before I attempt a port.

I’m getting interested in my robot gardener idea again, so am building up a net that I can use for it.

Some points about how this differs from “proper” ANNs.

  • Training is done against a single neuron at a time (not important in this case, as there are only three neurons anyway).
  • This net will attach all “normal” neurons to all other neurons. I don’t like the “hidden layer” model of ANNs – I think they’re limited.
  • No back-prop algorithms are used – I don’t trust “perfect” nets and prefer a bit of organic learning in my nets.
  • The net code itself is object-oriented and self-sufficient. It would be possible to take the code and use in another JS project.

ah… spring; when young men’s thoughts turn to…

robots!

So anyway, I moved house (long story short), meaning that I get to think more clearly, as the house is less cluttered, and the route to work involves crossing less roads.

This morning, I was thinking about my current project – I’m writing a recurrent connectionist network so my new robot can learn to recognise things like grass and rubbish (to cut the former, and remove the latter).

The walk was getting tiring, so I was thinking about segways as well, and wondering how easy it might be to make one.

This eventually evolved into an idea for a new transport system – you get a load of little robots (my gardening ones, for example), and get them to form a platform. Then a load more of them form another platform on top. Then, you stand on the top.

The “carpet” would move in the direction you lean. Of course, the speed wouldn’t be too impressive, but it would be better than walking.

When the lower layer encounters a rock on the road or something, it moves around it. The upper layer robots interlock with each other to allow the lower level bots to do this without having too much pressure from above.

When you reach where you are going, the robots then disperse and continue their gardening around the new area.

You could even form a baggage train using this idea – a few carpet networks would follow each other in marching-ant form.

This would be easier to do than to create a robot which does your gardening for you…