Create genre-specific melodies using TensorFlow

10 minute read

Published:

This post includes a walkthrough of training neural networks to generate genre-specific melodies using TensorFlow’s Magenta extension.

Setup your environment

First, install Magenta using the instructions on the repository. With docker, the following command will pull the docker container pre-loaded with Magenta:

$ docker run -it -p 6006:6006 -v /tmp/magenta:/magenta-data tensorflow/magenta

Warming up: generate melodies using the attention_rnn pre-trained model

In this example we use the attention_rnn module to produce melodies that have longer arching themes compared with the basic module.

Locate the pre-trained model attention_rnn.mag. Mine was in models: BUNDLE_FILE='/magenta/magenta/models/melody_rnn/attention_rnn.mag' and set CONFIG_FILE='attention_rnn':

$ melody_rnn_generate --config=${CONFIG} --bundle_file=${BUNDLE_FILE} --output_dir=/tmp/melody_rnn/generated --num_outputs=10 --num_steps=128 --primer_melody="[60]"

Explore the hyperparameters

Magenta creates an LSTM model on-the-fly using several user-defined hyperparameters. Let’s explore some of them to get a feeling for how they affect the sound.

Generate a melody using the pre-trained model and default values in the Magenta instructions (eg, primer_melody=[60],temperature=1.0). ‘60’ is a MIDI note corresponding to C5 in the MIDI note table.

Basic melody with key shift starting around 00:09
Playful, but not melodic

Alternate the primer melody

Change the primer melody to [60,-2,60,-2] to compare some outputs:

Not interesting or melodic
Melodic, but not interesting
Melodic, and a bit experimental
Repetitive, but showing syncopation

Getting warm - modify the temperature

Temperature is a measure of the randomness of the predictions. Temperature = 0.6 is expected to be less random/creative. See my post on Stack Exchange for a full explanation.

Musically bland.

Temperature = 1.2: Should be more random/adventurous.

The tritone, also known as diabolus in musica, the devil in music, is the rarest diatonic interval in Western music. Playing it (00:03) shows the randomness caused by increasing temperature.

Some samples sound quite pleasant and spontaneous:

So let’s stay with the default temperature (1.0). What if we change the starting primer note from C5 (60) to something less “happy” like a transition from A# (70) to C (72)? Now the primer melody is [70],[-2],[72],[-2]:

The notes sounds higher, obviously, but also perhaps less constrained to the elementary tonality of C major scale.
Sounds like a gypsy melody.

Choosing notes in the natural, diatonic major-scale (eg, C5, D5) for the primer melody will bias the network to select from combinations it learned that include these notes. In Western music, the natural diatonic scale is considered a happy one.

Exercise: Explore the sound space and compare the “mood” of melodies generated in various keys.

KeyAffective characteristic*
C MajorPure
D MajorTriumph
E♭ MajorLove
F MinorDepression
A♭ MajorDeath
B MajorColorful, passionate
  • Rita Steblin, A History of Key Characteristics in the 18th and Early 19th Centuries, UMI Research Press, 1983.

Softmax parameters - beam size and branch factor

Increasing the beam size and branch factor increases the accuracy of the prediction by increasing the search space in terms of producing a lower negative log-likelihood, but don’t necessarily increase the quality of the output. All outputs generated using the default parameters from beam_size=2 and branch_factor=2 consisted of the primer note (C5) repeated monotonously.

Increasing the temperature parameter to 1.4 allowed breaking out of the local minimum (but only barely) by adding minimal rhythmic variation:

Train a model using genre-specific sets of MIDI

Following Magenta’s “Building your Dataset” instructions, first we convert the MIDI file into NoteSequence format. NoteSequences are protocol buffers, Google’s serialized structured data method. It is like XML, but smaller (data is serialized into a binary wire format), faster and extendable to other languages.

Download the dataset

The Lakh MIDI dataset contains 50,000 MIDI files and is free to use. Download the ‘clean midi’ set.

Look up artist genres using the Spotify API

To see what kind of genres we have, use the Spotify API for looking up the external information.

The following script creates a list of authors found in the clean_midi folder and assigns the list of genres for each in the dictionary genres.

# Login to Spotify and get your OAuth token:
    # https://developer.spotify.com/web-api/search-item/
    AUTH = "ENTER-MY-AUTH-KEY"

    # Get artists from folder names
    artists = [item for item in os.listdir(
        'clean_midi') if not item.startswith('.')]

    sp = spotipy.Spotify(auth=AUTH)
    genres = {}
    for i, artist in enumerate(artists):
        try:
            results = sp.search(q=artist, type='artist', limit=1)
            items = results['artists']['items']
            genre_list = items[0]['genres'] if len(items) else items['genres']
            genres[artist] = genre_list
            if i < 5:
                print("INFO: Preview {}/5".format(i + 1),
                      artist, genre_list[:5])
        except Exception as e:
            print("INFO: ", artist, "not included: ", e)

the distribution of artists by genre in the Lakh MIDI dataset. See the preprocessing and visualization notebook for sample outputs and visualization scripts.

Visualize the genre-crossover

What are the most common genres in the dataset?

# Get the most common genres
flattened_list = [item for sublist in list(genres.values()) for item in sublist]
c = Counter(flattened_list)
c.most_common()[:7]

Output:

[('mellow gold', 385),
 ('soft rock', 373),
 ('rock', 355),
 ('classic rock', 289),
 ('album rock', 275),
 ('folk rock', 262),
 ('new wave pop', 247),

Place the data in a dataframe where each row corresponds with an artist and the columns are a binary feature of each genre.

# Convert labels to vectors
categories = set(sorted(list(flattened_list)))
df = pd.DataFrame(columns=categories)

for author,genre_list in genres.items():
    row = pd.Series(np.zeros(len(categories)),name=author)
    for ind,genre in enumerate(categories):
        if genre in genre_list:
            row[ind] = 1
    d = pd.DataFrame(row).T
    d.columns = categories
    df = pd.concat([df,d])
df = df.reindex_axis(sorted(df.columns), axis=1)

Then, to encode each genre with a color, we reduce the dimensions from 600 genres to a selection of meta-genres (eg, rock, pop) based on presence of features containing those words.

# Assign label for each author corresponding with meta-genre (eg, rock, classical)
def getStyle(genre_substring):
    """Get data where features contain `genre_substring`."""
    style_index = np.asarray([genre_substring in x for x in df.columns])
    style_array = df.iloc[:,style_index].any(axis=1)
    return style_array

# Create array of color/labels
color_array = np.zeros(df.shape[0])
genre_labels = ['other','rock','metal','pop','mellow','country', 'rap','classical']
for i,g in enumerate(genre_labels):
    if g == 'other':
        pass
    else:
        color_array[np.where(getStyle(g))] = i

Apply t-SNE and PCA

t-distributed stochastic neighbor embedding is a tool for visualizing data which uses an estimate of the probability of a relationship between data points. Principal component analysis on the other hand relies solely on the covariance matrix to find the directions of greatest variance for representing the data. The closeness of points here is a function of the crossover between genres. So an artist who plays both rap and classical music would be found between the two clusters visible in the t-SNE graph below.

Initialize a cmap variable for color sampling:

cmap  = matplotlib.cm.get_cmap(CMAP_NAME,lut=max(color_array)+1)

Initialize axes lists for multicolor scatter plotting.

axes_tsne = []
axes_pca = []

Use sklearn’s t-SNE method for 2-dimensional visualization:

model = TSNE(random_state=0)
np.set_printoptions(suppress=True)
X_tsne = model.fit_transform(df.values)

for l in set(color_array):
    ax = plt.scatter(X_tsne[color_array==l][:, 0], X_tsne[color_array==l][:, 1],c=cmap(l/max(color_array)), s=5)
    axes_tsne.append(ax)
legend(handles=axes_tsne, labels=genre_labels,frameon=True,markerscale=2)

and sklearn’s PCA method:

X_pca = PCA().fit_transform(df.values)
for l in set(color_array):
    ax = plt.scatter(X_pca[color_array==l][:, 0], X_pca[color_array==l][:, 1],c=cmap(l/max(color_array)), s=5)
    axes_pca.append(ax)

legend(handles=axes_pca, labels=genre_labels,frameon=True,markerscale=2)

2d visualization

The 3-dimensional visualization is similar. Check out the code. 3d visualization

Excercise: Visualize genre-crossovers with colors that uniformly fade with multi-label encoding.

Create melodies from two genres

Let’s create melodies from two genres: metal and classical. First, create subsets of each and put them in our subsets folder:

def get_artists(genre):
    """Get artists with label `genre`."""
    artists = [artist for artist, gs in genres.items() if genre in gs]
    return artists

# Get artist with genres 'soft rock' and 'disco'
genre_data = {}
metal = get_artists('metal')
classical = get_artists('classical')

genre_data['metal'] = metal
genre_data['classical'] = classical

# Copy artists to a genre-specific folder
for genre, artists in genre_data.items():
   try:
       for artist in artists:
           _genre = genre.replace(' ','_').replace('&','n')
           shutil.copytree(os.path.join(MIDI_DIR,artist),os.path.join(os.getcwd(),'subsets',_genre,artist))
   except Exception as e:
       print(e)

Preprocessing

Magenta requires data to be in the form of NoteSequences rather than MIDI. From the subsets directory, convert the MIDIs to NoteSequences using a bash script:

for genre in */
do
  if [[ $genre == *examples* ]]
  then continue
  fi
  convert_dir_to_note_sequences \
  --input_dir=$genre \
  --output_file=/tmp/${genre%/}_notesequences.tfrecord \
  --recursive && echo "INFO: ${genre%/} converted to NoteSequences"
done

The subsets folder contains the following:

classical/                        metal/
classical_notesequences.tfrecord  metal_notesequences.tfrecord

Collate the .tfrecord database into the sequence_examples/[genre] folder:

for genre in */
do
  if [[ $genre == *examples* ]]
  then continue
  fi
  melody_rnn_create_dataset \
  --config=attention_rnn \
  --input=/tmp/${genre%/}_notesequences.tfrecord \
  --output_dir=sequence_examples/${genre} \
  --eval_ratio=0.10 && echo "INFO: ${genre%/} database created."
done

Training your models

If you don’t wish to train your own models, use these pre-trained weights trained to 200 steps and extract them to /tmp/melody_rnn/. Bigger is often better in deep learning, but for learning how the architecture affects the performance, simply train the model with a batch size of 64 and a 2-layer RNN of 64 units each and 200 training steps:

for genre in */
do
  if [[ $genre == *examples* ]]
  then continue
  fi
  melody_rnn_train \
  --config=attention_rnn \
  --run_dir=/tmp/melody_rnn/logdir/run1/${genre} \
  --sequence_example_file=$(pwd)/sequence_examples/${genre%/}/training_melodies.tfrecord \
  --hparams="{'batch_size':64,'rnn_layer_sizes':[64,64]}" \
  --num_training_steps=200 && echo "INFO: ${genre%/} model trained."
done

Excercise: Compare training performance of models with various architectures and hyperparameters.

Make music

Finally, produce some melodies using the trained models. Make sure the hparams is the same as in the previous step.

for genre in */ ;
do
  if [[ $genre == *examples* ]];
  then continue
  fi
  melody_rnn_generate \
  --config=attention_rnn \
  --run_dir=/tmp/melody_rnn/logdir/run1/${genre} \
  --output_dir=/tmp/melody_rnn/generated/${genre} \
  --num_outputs=10 \
  --num_steps=128 \
  --hparams="{'batch_size':64,'rnn_layer_sizes':[64,64]}" \
  --primer_melody="[60]" && echo "INFO: ${genre%/} melodies generated."
done

The MIDI files are in /tmp/melody_rnn/generated/[genre].

Convert MIDI files to mp3

Many modern operating systems (and most web browsers) do not natively play MIDI. Convert the files to a flexible format by installing timidity with brew install timidity (Mac) or sudo apt-get install timidity (Linux).

To play the sounds in a browser, convert the files to a format supported by HTML5, such as mp3. Use a one-line shell command in the folder /tmp/melody_rnn/generated to convert all MIDI files in the directory to mp3:

$ for file in *.mid
do
  timidity "${file}" -Ow -o - | \
  ffmpeg -i - -acodec libmp3lame -ab 64k "${file%.*}.mp3";
done

Samples

Genre123
Metal
ClassicalFour Seasons-inspired?
Latin 
Funk  
Punk  
Jazz  
R&B  

Conclusion

In some cases, the influence of the genres is visible. Further training would likely lead to improved performance, and stripping the percussive tracks would also help.

After listening to hundreds of such clips (and these are representative of the more pleasant ones), I have learned to appreciate ordinary human ability to compose melodies. I’m interested in what you think.

Download the source code for visualizing, preprocessing, and scripting.

Leave a Comment