AI with Aimee: Training Voice Models for Singing Using Google Colab and RVC

Ever really wanted to start a band but you couldn't convince friends to join you, you didn't have any musical instruments, and you're only a so-so singer? Now, with the power of AI, you can!

Ever really wanted to start a band but you couldn’t convince friends to join you, you didn’t have any musical instruments, and you’re only a so-so singer? Now, with the power of AI, you can!

One day I wanted to start a band called the Teacats (I made this image using Adobe Photoshop, Illustrator, and MidJourney. I’m the blue haired character “Hoppy”). I have two friends named DAVE (bongo drum) and B3AM (guitar) who were willing to lend me audio samples of their voices so that I could live my lifelong dream to become the sea witch Ursula (from Disney’s ‘The Little Mermaid’) and steal voices for my own nefarious purposes. Jokes aside, this is the story of how I (fairly quickly) made an ethically-sound AI-powered music band, and so can you. Here’s a sample! It’s not the greatest, and it’s not polished or finished but you can see it in progress!

Want to do the same for your projects? You just need a little bit of fairy dust… creativity… RVC… and willingness to learn things!

In this tutorial I will take my B3AM voice model and use a dry (non-processed) inference (we’ll get to that!) file I purchased the rights to from LANDR and show the end result in my next article which will be about the inference process.

So what do these terms mean? RVC? Voice model? Inference? Let’s break them down.

RVC stands for Retrieval-Based Voice Conversion and it’s a process that uses minimal data in order to transform one voice into another voice using deep neural networks. So, what’s a deep neural network? I’ll give you my definition and then what Chat GPT corrected me and said it really is.

My definition: Deep Neural Networks are an area of machine learning which aims to mimic the way the brain processes information in order to evaluate and utilize data.

Chat GPT’s correction: Deep Neural Networks (DNNs) are a fundamental concept in machine learning that attempts to replicate how the human brain processes information. These networks consist of interconnected nodes, akin to artificial neurons, organized in layers. By processing and learning from vast amounts of data, DNNs can recognize patterns, make predictions, and solve complex tasks. They excel in tasks like image and speech recognition, natural language processing, and decision-making, making them essential in the development of advanced AI systems.

Yeah. I meant to say that.

Your voice model is who you WANT to hear singing as the end result.

Your inference file is the file YOU provide of what you want your voice model to SING like.

So your voice model does not need to be trained with singing files but your inference file needs to be singing. Your voice model will then work to capture the timbre of your inference file while it is working to replicate it.

Timbre (pronounced: tum-bra) is defined by Oxford languages as the character or quality of a musical sound or voice as distinct from its pitch and intensity.

Let’s get started!

Step one: Gather audio samples for your voice model.

Gather audio samples of the voice you want to HEAR singing as the end result (MP3 or WAV file formats are both fine). For this tutorial, the voice samples will be from my friend B3AM:

With the RVC fork I like to currently use, I’ve seen that we don’t really need the samples to include singing even for vocal models. I’ve tried it both ways. Haven’t noticed a big enough difference to add in singing samples as of yet. Isn’t that kind of crazy?

Step two: Clean up the audio samples.

You can use open source software like Audacity to do so, but make sure you don’t have a lot of ums, uhs, weird noises, other people talking in the background, or significant pauses of silence in your vocal samples. There are paid services that can offer this for you. Vimeo is one example. Although considered a YouTube alternative for media professionals sharing content to their core clients, Vimeo has entered the AI game in its own right. With Vimeo, you can upload your source video, delete ums, uhs, and pauses (you can indicate how long of pauses you wish to keep) and then download the end result.

I asked MidJourney to design a fairy cleaning a house (for a cute “cleaning” image to go along with this journal entry) and instead it sent me a lazy elven fairy in an editorial style pose with a white rag. -_- XD

Step three: Enhance training data sound.

This step is optional, but I love processing my audio samples through Adobe Enhance Speech. You can find it here. It appears Adobe has no current plans to charge for lite usage of Enhance Speech. They do offer a paid plan through Adobe Express which offers no time limitations but currently the limits on the free plan are generous and provide the same quality output.

If you choose to complete this step, it’s important to listen to your audio before adding it to your training data. There is an effect slider on the page to modify how much processing is done on the file. Sometimes words can be chopped off or their sound altered slightly. I love the smooth podcast effect Enhance Speech gives, but not at the cost of degrading the training data. So I don’t use it all the time.

Step four: Sign up for a Google Drive account.

It’s free and connected to your Gmail address if you have one.

Step five: Journey to Google Colab and sign into your Google account.

This is the link to the Google Colab that I use to train my singing voice models for right now. You can get started on the Colab for free for lite usage but unless you have a paid plan (which start at around $9.99 per month) speeds for training voice models are quite slow. I wanted to keep everything in this journal article free so I kept with the T4 GPU (graphics processing unit) option and a 45 second voice clip took 39 minutes to train a 400 epoch model. If you have a lot of voice models you need to train, it might be worth a paid monthly plan where you can use the A100 or V100 GPU options.

Step six: Familiarize yourself with Google Colab.

Though this step sounds simple, it’s probably one of the most complex steps for users new to Google Drive and/or Google Colab. So I’ll be giving a small tour of Google Drive / Google Colab to help familiarize you with the starting file hierarchy and navigation you will see when booting up this Colab for the first time.

Treat yourself to something rewarding (a bubble bath, a bowl of ice cream, etc.) for completing this section if you’re not used to working with non GUI applications. Once you successfully complete this section you’ll be prepared to open up all kinds of Colab workspaces!

Step Six (A): How to Connect a Colab notebook to a GPU.

Step Six (B): How to Connect to Google Drive.

  1. Under the “Install to Google Drive” section, click on the Play icon to run the task of connecting your Google Drive account to Google Colab. A window will pop-up asking you to confirm and approve of the permissions at your discretion.

Step Six (C): How to view the /content / dataset folder.

To view the content folder shown in the Preprocess Data section, click the folder with the upward arrow on it as shown in the image above.

To upload a file to the dataset folder, right click on the dataset folder (1) and then click on the upload option (2).

There are a few options you have if you become lost. One option is to verify your current directory. To view your current directory, toggle the code hide / unhide arrows and look for the line that starts with %cd (current directory) near the top.

A second option is to upload your files to a folder of your choice and then use the copy path option to copy the path to the directory so that you can paste it into the Google Colab.

Step Seven: Upload your training data file to the /content/dataset folder.

Step Eight: Complete the Preprocess Data category.

  1. Name your model after your voice training file. If you have multiple voice training files for one model, zip them together and name the model_name after the zipped file name.
  2. You only need to modify this line if you uploaded your file to a different folder. If you did upload your voice training file to a different folder, paste the file path here. For additional help, please reference the fourth image under Step Six (C).
  3. Click on the run arrow to begin preprocessing the data.

Step Nine: Click on the Extract Features run arrow to run the task.

Step Ten: Click on the Train Index run arrow to run the task.

Step Eleven: Configure the Train Model section and then run the Train Model task.

  1. Sign up for an ngrok account and then retrieve your ngrok authtoken.
  2. Paste your ngrok authtoken into the Colab notebook.
  3. Re-enter your model training data name again. This will also be the name of your trained voice model .pth (a PyTorch extension containing your unique model training data) file.
  4. Instruct the Colab how many times you wish to save a voice model at certain completed epochs during training. Each epoch equals 1 time your training data is passed through the learning algorithm; so if you are saving every 50 epochs, then all of your training data has passed through the learning algorithm 50 times.
  5. List the total number of epochs you would like to train your voice model. Generally, the more training the better results but sometimes over-training is also not necessary. 300-500 epochs is generally accepted to be fine among the RVC learning community currently.
  6. Cache is not mandatory for datasets under 10 minutes long but you can try it either way. Use_OV2 is a new feature as of this week in the Colab I use so I am currently unfamiliar with it. However, I did not run into unexpected results when I left the box checked.
  7. When you are completed with steps 1-6, run the Train Model task button and watch the training magic happen!

For reference sake, I’ve included screenshots of the process of signing up for an ngrok account and retrieving your authtoken.

Step Eleven (A): Sign up for an NGROK account.

Step Twelve: Wait for confirmation that your model has completed training.

Step Thirteen: Confirm that your model is now saved in your Google Drive. It should be under project-main -> assets -> weights.

I have a lot of models so I couldn’t get my B3AM model to show on the same screen as the file hierarchy, but from this view you can also see how the save frequency works. For example, if I were to train my B3AM model on 1000 epochs, but my save-point of 500 epochs produced the best result, I could use that saved model instead when I want to produce a voice conversion file.

Step Fourteen: Click on the additional connections option and then end the session by disconnecting and deleting the runtime.

If you completed all of the steps above, congratulations! You’ve successfully trained a voice model using RVC technology!

I went back to ask MidJourney for an illustration of a fairy wishing someone congratulations and well, this is apparently how faeries wish someone congratulations.

In the next journal entry, we’ll go over how to make a new voice file sound like the model you just completed training!

Share the Post:

Related Posts

Join Our Newsletter

AI with Aimee: Training Voice Models for Singing Using Google Colab and RVC

Ever really wanted to start a band but you couldn't convince friends to join you, you didn't have any musical instruments, and you're only a so-so singer? Now, with the power of AI, you can!
Share the Post:

Related Posts