Difference between revisions of "Evolutionary Music"
(19 intermediate revisions by the same user not shown) | |||
Line 9: | Line 9: | ||
In the next week I'm hoping to lock down a dataset of birdsong (there seem to be multiple options!) and start digging into the WaveGAN paper. I haven't ever worked with GANs before, so I'm looking forward to it! | In the next week I'm hoping to lock down a dataset of birdsong (there seem to be multiple options!) and start digging into the WaveGAN paper. I haven't ever worked with GANs before, so I'm looking forward to it! | ||
− | == Week 3 | + | == Week 3 Update == |
Just starting off the week with a bunch of links related to birdsong that I'm following. | Just starting off the week with a bunch of links related to birdsong that I'm following. | ||
Line 21: | Line 21: | ||
-[https://ccrma.stanford.edu/software/snd/snd/clm.html Bill Schottstaedt's page on music generation (including birdsong)] | -[https://ccrma.stanford.edu/software/snd/snd/clm.html Bill Schottstaedt's page on music generation (including birdsong)] | ||
+ | |||
+ | -[https://csoundjournal.com/ezine/winter2000/realtime/ A Csound synthesizer for birdsong] | ||
+ | |||
+ | -[https://github.com/csound/csoundAPI_examples/blob/master/python/README.md Csound python examples] | ||
+ | |||
+ | -[https://github.com/ilknuricke/birdsong_generator_CPG Matlab implementation of birdsong simulation] | ||
+ | |||
+ | -[https://ccrma.stanford.edu/software/snd/index.html Snd homepage] | ||
+ | |||
+ | -[http://www.mee.tcd.ie/~sigmedia/Resources/SynthBirdsData Synthetic bird sounds dataset] | ||
+ | |||
+ | As should be clear from the mountain of links immediately above, I spent the majority of Week 3 doing research. After reading through the WaveGAN paper, I realized that it wouldn't be the best fit for the project. WaveGAN outputs actual waveforms, so in order to remain computationally feasible it's capped at 1 second of audio, which isn't enough to simualte some of the more interesting bird calls. Not to mention, the authors already tested their algorithm on birdsong! | ||
+ | |||
+ | So I started looking for some kind of a synthesizer that would (ideally) have a small number of parameters. This took an unfortunately large amount of time. I kept bouncing back and forth between a paper I'd found on evolving birdsong that lacked any sort of documentation except for a pointer to an outdated Csound synthesizer, an implementation in Matlab using differential equations that I couldn't really parse, and the work of CCRMA's own Bill Schoedstadt from a number of years ago. It took some doing to get Scheme successfully installed onto my laptop, but ultimately I got Bill's configuration to work! Now comes the hard part: digging into the code to understand which parameters do what, and how each species of bird could be rendered as some kind of computational genotype. | ||
+ | |||
+ | Also of note, my dad pointed me towards a paper on [https://www.cse.msu.edu/~rossarun/pubs/BontragerRossDeepMasterPrint_BTAS2018.pdf latent variable evolution] which seems super relevant to my project and was written by my soon-to-be-mentor at NYU. It's a small world! | ||
+ | |||
+ | == Week 4 Update == | ||
+ | |||
+ | This week was spent familiarizing myself with Scheme (in general) and Bill Schoedstadt's code (in specific). Unfortunately there isn't much to show or write about this, but I do want to shout out to Chris Chafe for walking me through some code line by line! | ||
+ | |||
+ | In terms of how to render the birdsong as a genotype, I'm again running into a bit of a snag. Bill's bird synthesizers, while amazingly high-fidelity, are extremely heterogenous. Very little code is re-used from one synthesizer to the next, making it very difficult to decompose the code enough to use genetic programming. So at the moment I'm leaning a bit towards using a simple genetic algorithm to output parameters to replace the defaults in the synthesizer. But this runs into another problem! The set of parameters that make a good fit is extremely dependent on the specific synthesizer into which they're inputted: a set of parameters that sounds great as a loon could sound awful as a robin. This makes evaluating the fitness of any given genotype very challenging. So to combat this I'm considering using a clustering approach to first bucket the 150-or-so bird recordings into a few manageable buckets. That way, hopefully the genetic algorithm can be reasonably certain that it's parameters will sound good for any synthesizer in the cluster. But to be honest, I have no idea if it will work! | ||
+ | |||
+ | |||
+ | == Week 5 Update == | ||
+ | |||
+ | Some good progress this week! I continued a bit down the path of birdsong clustering and wrote my first piece of real Scheme code (woot woot woot!) which collected each of the birdsong recordings in their own .wav file. After this, I used the tools from pyAudioAnalysis to extract the chromagrams and used that information to do clustering on the songs. This technically ''worked'', in the sense that it produced valid output, but it certainly didn't produce any coherent clusters. After starting to think a bit about how to improve the clustering (perhaps using other features like MFCC, or some kind of temporal convolution?) I realized that this might be going about the problem the wrong way. Instead of trying to cluster 150 bird recordings into 20 groups and then trying to extract some kind of common synthesizer architecture from each cluster, why don't I instead simply fix the synthesizer architecture and leave only the parameters up to the GA? This struck me as a much more efficient way to go about things, so I changed tactic. Now, instead of explicitly using the GA to produce bird-like calls that did not align exactly with existing calls, the challenge for the algorithm is to reproduce an existing bird call using only a fixed and rather limited audio setup (namely, a single polywave oscillator, with the amplitude envelope, frequency envelope, and partials left up to the GA). I wrote some more Scheme code to get this working (well, mostly I just pared down one of Bil's existing synthesizers) as well as a simple python script to fill in the aforementioned parameters. At this point, I could generate a random genotype (read: list of 91 floating point numbers) and convert it into a "bird call." Given the random nature of the input, the output was basically just a warbled screech. But it's a start! I also had some code for evolutionary optimization lying around from a previous project, so all I needed to get that working would be a fitness function -- a way to evaluate how "good" a given genotype is. I whipped one together by using the same set of audio features I used for clustering and defining fitness as the inverse of the sum of the distance between the output and a target birdcall for each of the features. This... so far has not produced good outputs. I've only let the algorithm run for about an hour, so it's possible with more time something better might emerge, but I also think the fitness function is a little sketch, so I'm looking for a better way to compare two audio files. I've attached some links from my searching below: | ||
+ | |||
+ | -[https://corpustools.readthedocs.io/en/latest/introduction.html Phonological Corpus Tools (maybe better suited for speech)] | ||
+ | |||
+ | -[http://www.globalaccentchinese.com/back/kindeditor/attached/file/20181219/20181219080155_15675.pdf Paper on comparing speech, cited by PCT] | ||
+ | |||
+ | -[https://towardsdatascience.com/calculating-audio-song-similarity-using-siamese-neural-networks-62730e8f3e3d Comparing songs with Siamese Neural Networks] | ||
+ | |||
+ | -[https://towardsdatascience.com/an-illustrative-introduction-to-dynamic-time-warping-36aa98513b98 Dynamic time warping] | ||
+ | |||
+ | -[https://github.com/magenta/ddsp/blob/master/ddsp/losses.py Magenta's Spectral Loss] | ||
+ | |||
+ | == Week 6 and 7 Update == | ||
+ | |||
+ | Apologies for missing last week! But I'll hope to condense most of the progress from both weeks into a readable summary here, especially since most of it was spent on the same task. Throughout week 6, I was mostly working on implementing a new loss function using the Spectral Loss module in the Differential Digital Signal Processing library from Google Magenta (thanks Ketan for the pointer!). The spectral loss function seemed to provide a better signal than the previous loss function I was using, but it was limited by the fact that the target audio and the generated audio needed to be the exact same number of samples. This meant that I had to manually fix the duration of the generated audio (as opposed to letting the GA modify that as a parameter), which was definitely suboptimal. | ||
+ | |||
+ | That was until Ketan (thanks again) pointed me towards DDSP's resampling function, which allows me to rescale the generated output to match the target data! This solved the block and made it so I could return duration to being a modifiable parameter. I also spent a lot of time porting my code over to a new evolutionary computation library. This seemed to help on a number of fronts (efficiency, performance, etc). I took the opportunity to also re-factor my code base, which had gotten a little messy. This refactor continued throughout the end of week 7, and is now complete! Finally, I also figured out how to use the new library to have different parameters of differing datatypes. This means that I can have envelope breakpoints remaining as a vector of decimal numbers, but also add in categorical variables like the number of oscillators to use. I have all the framework set up for this on the evolutionary algorithm side, and now I just need to make the necessary edits for a second generator in the scheme code. | ||
+ | |||
+ | The current outputs seem, broadly speaking, to be doing well. In particular, the algorithm seems good at getting the right pitches. What it seems to struggle with, however, is getting the correct duration and, in cases where there are multiple noises, rendering each of them sequentially. I've also noticed that occasionally the model outputs songs with very low volume (much lower than the target). I suspect that this might be because lowering the amplitude is a good way for the model to "hedge its bets" with respect to the loss function -- it's akin to when a visual generative model produces outputs that are mostly gray, since that averages the extremes of all the colors. But that's just a hunch, and I'm thinking of making some more visualizations to give me a better way to interrogate the features of the generated audio. | ||
+ | |||
+ | == Week 8 Update == | ||
+ | This past week I worked on a new feature-based loss function, some potential debugging, and some visualizations for the model's output. The new loss function was a return to the auditory features approach I was working on before being introduced to the Spectral Loss function. At the time, the issue was that I had to take the average of these features and thus lost a lot of information. With the re-sampling approach, I was able to just take as the loss function the average vector distance between the target sample and the generated sample for a list of features. Of all the possible features, I wound up settling on only those relating to the Mel Frequency and the Chroma values. This loss function appeared to produce results roughly similar to those of the Spectral Loss, with perhaps slightly better fidelity with respect to pitch but worse with respect to timbre. But, in general, the model is still struggling to really re-capture the sound of the bird recordings. | ||
+ | |||
+ | So I decided to try and measure more concretely where the model is failing. Using just one particular bird recording as a test case (the Blue Grosbeak), I measured the distance between the model's best phenotype and the target directly. For instance, I measured the vector distance between the target frequency envelope and the model's generated envelope. I noticed that these metrics did not really decrease very reliably, which unfortunately indicates that the loss functions may not be the best measures of, for instance, frequency and amplitude similarity. I then tried to use these feature differences as the loss function directly, but this also didn't seem to yield better results, which was very surprising. This could indicate that the mutation rate is too high, but a quick experiment I ran seemed to indicate that lower mutation led to worse performance. Currently, I'm a little stumped and plan to continue investigating this. In theory, this artificial loss function where I compare directly to the target envelopes ought to give perfect performance. | ||
+ | |||
+ | == Week 9 and 10 Update == | ||
+ | |||
+ | Sorry this update comes a bit delayed -- I got sucked in to trying to get the birds to sound better! Week 9 (and what has passed of Week 10) were quite busy, basically a flurry of trying all sorts of different approaches (none of which have quite seemed to work). I started off by introducing annealing to the mutation rate, which basically means that the amount of randomness added to each generation goes down over time. In theory, this should help convergence as the algorithm will have an easier time settling in to the correct solution once it gets close, and be able to move quickly through the space of solutions at the beginning. In practice, I didn't really see any difference in performance. | ||
+ | |||
+ | After that I realized that a potential bug was lurking in the code: I was resampling the generated outputs in order to ''compare'' them to the target recording, but ''wasn't'' resampling when I was outputting the generated recordings for me to listen to. To my surprise (and dismay), fixing this error seemed to decrease the quality of the outputs! So I've been outputting both the original and resampled recordings now just to check which is better in each case. As part of fixing this bug, I wondered if perhaps by resampling function was to blame, so I sought out a number of other options (and played around for a significant while with the parameters on DDSP's resample function). I even undertook an ill-fated attempt to implement my own resampling function. Unfortunately, as with my previous interventions, none of these seemed to improve things, so I went back to my initial resampling function. During this time, I was also playing with different hyperparameter settings and even some different loss functions. Again, no improvement. | ||
+ | |||
+ | Finally, at wits end, I thought I would give one last wacky attempt to solve the problem: a new loss function based on ''image difference'' instead of audio difference by first rendering the recordings as spectrograms. As I write this, I'm still in the process of seeing if it works, though early results don't leave me optimistic. Sad to say, but on the whole I think I have to chalk up this project as a failure. That's not to say I didn't learn anything -- but perhaps the biggest lesson is that Bill Schoettstadt really did some amazing work when he hand-carved the envelopes for those 150-or-so birds. | ||
+ | |||
+ | UPDATE: I think I was a little too pessimistic when I wrote that previous paragraph! I hadn't given a listen to ''all'' of the outputs of the model, and while most certainly were unlistenable garbage, there a couple of diamonds in the rough! I managed to collect 4 recordings that I'm quite happy with, and also explored recombinations between them, which led to some interesting results. |
Latest revision as of 23:50, 2 June 2021
Contents
Week 2 Update
This week was mostly about settling on a feasible and well-defined project. I'm interested in both evolutionary models and birdsong, so I've been leaning towards something that can combine both. The tough thing when it comes to designing an evolutionary model of music production is coming up with a fitness function. A genetic algorithm needs access to some way to map a genotype to a fitness value, and it's not clear how this can be done when a genotype is a piece of music. What makes good music? Between two snippets, how can you decide which is better?
Well, one thing you can do is ask how similar the piece of music is to pieces of music you know are good! There are more quantitative metrics of similarity than there are of quality. But rather than measure similarity directly, you could also get to it by asking how easy it is to tell the generated piece from one of the good pieces. This is the logic behind GANs, or generative adversarial networks. A GAN consists of two systems: a generator and a discriminator. The job of the generator is to take a set of data and try to produce more examples of that dataset. For instance, you might feed the generator a punch of Picasso paintings and ask it to generate more. The job of the discriminator is to take one of the generator's outputs alongside an example from the original dataset and try to determine which is which. The two systems encourage each other to get better, and eventually the goal is to have the performance of the discriminator fall to just random chance. At that point, there's no way to tell (from the computer's perspective, at least) the generated outputs from the original dataset. The canonical GAN is used on images, but Stanford's very own Chris Donahue helped develop the WaveGAN, which works over raw waveforms.
I'm interested in seeing if I can get WaveGAN to work for birdsong. I'm also interested in seeing if evolving the weights of WaveGAN (using the discriminator error rate as the fitness function) can achieve comparable results to the canonical learning algorithm, but given the finicky-ness of GAN training, that might need to be a project for another day.
In the next week I'm hoping to lock down a dataset of birdsong (there seem to be multiple options!) and start digging into the WaveGAN paper. I haven't ever worked with GANs before, so I'm looking forward to it!
Week 3 Update
Just starting off the week with a bunch of links related to birdsong that I'm following.
-A paper on an evolutionary model of birdsong
-A paper modeling syringeal vibrations in songbirds
-An old MUSIC220A song on birdsong
-A pitch tracing application (for cleaning up birdsong field recordings)
-Bill Schottstaedt's page on music generation (including birdsong)
-A Csound synthesizer for birdsong
-Matlab implementation of birdsong simulation
-Synthetic bird sounds dataset
As should be clear from the mountain of links immediately above, I spent the majority of Week 3 doing research. After reading through the WaveGAN paper, I realized that it wouldn't be the best fit for the project. WaveGAN outputs actual waveforms, so in order to remain computationally feasible it's capped at 1 second of audio, which isn't enough to simualte some of the more interesting bird calls. Not to mention, the authors already tested their algorithm on birdsong!
So I started looking for some kind of a synthesizer that would (ideally) have a small number of parameters. This took an unfortunately large amount of time. I kept bouncing back and forth between a paper I'd found on evolving birdsong that lacked any sort of documentation except for a pointer to an outdated Csound synthesizer, an implementation in Matlab using differential equations that I couldn't really parse, and the work of CCRMA's own Bill Schoedstadt from a number of years ago. It took some doing to get Scheme successfully installed onto my laptop, but ultimately I got Bill's configuration to work! Now comes the hard part: digging into the code to understand which parameters do what, and how each species of bird could be rendered as some kind of computational genotype.
Also of note, my dad pointed me towards a paper on latent variable evolution which seems super relevant to my project and was written by my soon-to-be-mentor at NYU. It's a small world!
Week 4 Update
This week was spent familiarizing myself with Scheme (in general) and Bill Schoedstadt's code (in specific). Unfortunately there isn't much to show or write about this, but I do want to shout out to Chris Chafe for walking me through some code line by line!
In terms of how to render the birdsong as a genotype, I'm again running into a bit of a snag. Bill's bird synthesizers, while amazingly high-fidelity, are extremely heterogenous. Very little code is re-used from one synthesizer to the next, making it very difficult to decompose the code enough to use genetic programming. So at the moment I'm leaning a bit towards using a simple genetic algorithm to output parameters to replace the defaults in the synthesizer. But this runs into another problem! The set of parameters that make a good fit is extremely dependent on the specific synthesizer into which they're inputted: a set of parameters that sounds great as a loon could sound awful as a robin. This makes evaluating the fitness of any given genotype very challenging. So to combat this I'm considering using a clustering approach to first bucket the 150-or-so bird recordings into a few manageable buckets. That way, hopefully the genetic algorithm can be reasonably certain that it's parameters will sound good for any synthesizer in the cluster. But to be honest, I have no idea if it will work!
Week 5 Update
Some good progress this week! I continued a bit down the path of birdsong clustering and wrote my first piece of real Scheme code (woot woot woot!) which collected each of the birdsong recordings in their own .wav file. After this, I used the tools from pyAudioAnalysis to extract the chromagrams and used that information to do clustering on the songs. This technically worked, in the sense that it produced valid output, but it certainly didn't produce any coherent clusters. After starting to think a bit about how to improve the clustering (perhaps using other features like MFCC, or some kind of temporal convolution?) I realized that this might be going about the problem the wrong way. Instead of trying to cluster 150 bird recordings into 20 groups and then trying to extract some kind of common synthesizer architecture from each cluster, why don't I instead simply fix the synthesizer architecture and leave only the parameters up to the GA? This struck me as a much more efficient way to go about things, so I changed tactic. Now, instead of explicitly using the GA to produce bird-like calls that did not align exactly with existing calls, the challenge for the algorithm is to reproduce an existing bird call using only a fixed and rather limited audio setup (namely, a single polywave oscillator, with the amplitude envelope, frequency envelope, and partials left up to the GA). I wrote some more Scheme code to get this working (well, mostly I just pared down one of Bil's existing synthesizers) as well as a simple python script to fill in the aforementioned parameters. At this point, I could generate a random genotype (read: list of 91 floating point numbers) and convert it into a "bird call." Given the random nature of the input, the output was basically just a warbled screech. But it's a start! I also had some code for evolutionary optimization lying around from a previous project, so all I needed to get that working would be a fitness function -- a way to evaluate how "good" a given genotype is. I whipped one together by using the same set of audio features I used for clustering and defining fitness as the inverse of the sum of the distance between the output and a target birdcall for each of the features. This... so far has not produced good outputs. I've only let the algorithm run for about an hour, so it's possible with more time something better might emerge, but I also think the fitness function is a little sketch, so I'm looking for a better way to compare two audio files. I've attached some links from my searching below:
-Phonological Corpus Tools (maybe better suited for speech)
-Paper on comparing speech, cited by PCT
-Comparing songs with Siamese Neural Networks
Week 6 and 7 Update
Apologies for missing last week! But I'll hope to condense most of the progress from both weeks into a readable summary here, especially since most of it was spent on the same task. Throughout week 6, I was mostly working on implementing a new loss function using the Spectral Loss module in the Differential Digital Signal Processing library from Google Magenta (thanks Ketan for the pointer!). The spectral loss function seemed to provide a better signal than the previous loss function I was using, but it was limited by the fact that the target audio and the generated audio needed to be the exact same number of samples. This meant that I had to manually fix the duration of the generated audio (as opposed to letting the GA modify that as a parameter), which was definitely suboptimal.
That was until Ketan (thanks again) pointed me towards DDSP's resampling function, which allows me to rescale the generated output to match the target data! This solved the block and made it so I could return duration to being a modifiable parameter. I also spent a lot of time porting my code over to a new evolutionary computation library. This seemed to help on a number of fronts (efficiency, performance, etc). I took the opportunity to also re-factor my code base, which had gotten a little messy. This refactor continued throughout the end of week 7, and is now complete! Finally, I also figured out how to use the new library to have different parameters of differing datatypes. This means that I can have envelope breakpoints remaining as a vector of decimal numbers, but also add in categorical variables like the number of oscillators to use. I have all the framework set up for this on the evolutionary algorithm side, and now I just need to make the necessary edits for a second generator in the scheme code.
The current outputs seem, broadly speaking, to be doing well. In particular, the algorithm seems good at getting the right pitches. What it seems to struggle with, however, is getting the correct duration and, in cases where there are multiple noises, rendering each of them sequentially. I've also noticed that occasionally the model outputs songs with very low volume (much lower than the target). I suspect that this might be because lowering the amplitude is a good way for the model to "hedge its bets" with respect to the loss function -- it's akin to when a visual generative model produces outputs that are mostly gray, since that averages the extremes of all the colors. But that's just a hunch, and I'm thinking of making some more visualizations to give me a better way to interrogate the features of the generated audio.
Week 8 Update
This past week I worked on a new feature-based loss function, some potential debugging, and some visualizations for the model's output. The new loss function was a return to the auditory features approach I was working on before being introduced to the Spectral Loss function. At the time, the issue was that I had to take the average of these features and thus lost a lot of information. With the re-sampling approach, I was able to just take as the loss function the average vector distance between the target sample and the generated sample for a list of features. Of all the possible features, I wound up settling on only those relating to the Mel Frequency and the Chroma values. This loss function appeared to produce results roughly similar to those of the Spectral Loss, with perhaps slightly better fidelity with respect to pitch but worse with respect to timbre. But, in general, the model is still struggling to really re-capture the sound of the bird recordings.
So I decided to try and measure more concretely where the model is failing. Using just one particular bird recording as a test case (the Blue Grosbeak), I measured the distance between the model's best phenotype and the target directly. For instance, I measured the vector distance between the target frequency envelope and the model's generated envelope. I noticed that these metrics did not really decrease very reliably, which unfortunately indicates that the loss functions may not be the best measures of, for instance, frequency and amplitude similarity. I then tried to use these feature differences as the loss function directly, but this also didn't seem to yield better results, which was very surprising. This could indicate that the mutation rate is too high, but a quick experiment I ran seemed to indicate that lower mutation led to worse performance. Currently, I'm a little stumped and plan to continue investigating this. In theory, this artificial loss function where I compare directly to the target envelopes ought to give perfect performance.
Week 9 and 10 Update
Sorry this update comes a bit delayed -- I got sucked in to trying to get the birds to sound better! Week 9 (and what has passed of Week 10) were quite busy, basically a flurry of trying all sorts of different approaches (none of which have quite seemed to work). I started off by introducing annealing to the mutation rate, which basically means that the amount of randomness added to each generation goes down over time. In theory, this should help convergence as the algorithm will have an easier time settling in to the correct solution once it gets close, and be able to move quickly through the space of solutions at the beginning. In practice, I didn't really see any difference in performance.
After that I realized that a potential bug was lurking in the code: I was resampling the generated outputs in order to compare them to the target recording, but wasn't resampling when I was outputting the generated recordings for me to listen to. To my surprise (and dismay), fixing this error seemed to decrease the quality of the outputs! So I've been outputting both the original and resampled recordings now just to check which is better in each case. As part of fixing this bug, I wondered if perhaps by resampling function was to blame, so I sought out a number of other options (and played around for a significant while with the parameters on DDSP's resample function). I even undertook an ill-fated attempt to implement my own resampling function. Unfortunately, as with my previous interventions, none of these seemed to improve things, so I went back to my initial resampling function. During this time, I was also playing with different hyperparameter settings and even some different loss functions. Again, no improvement.
Finally, at wits end, I thought I would give one last wacky attempt to solve the problem: a new loss function based on image difference instead of audio difference by first rendering the recordings as spectrograms. As I write this, I'm still in the process of seeing if it works, though early results don't leave me optimistic. Sad to say, but on the whole I think I have to chalk up this project as a failure. That's not to say I didn't learn anything -- but perhaps the biggest lesson is that Bill Schoettstadt really did some amazing work when he hand-carved the envelopes for those 150-or-so birds.
UPDATE: I think I was a little too pessimistic when I wrote that previous paragraph! I hadn't given a listen to all of the outputs of the model, and while most certainly were unlistenable garbage, there a couple of diamonds in the rough! I managed to collect 4 recordings that I'm quite happy with, and also explored recombinations between them, which led to some interesting results.