Easy Text-to-Speech and Voice Cloning

Open source software (OSS) and publicly accessible linguistic corpora has made text-to-speech and voice cloning very accessible to those who are willing to spend the time to learn how to use, tweak, train, and/or customize a voice model. By the end of this article you will be able to have fun with text-to-speech (TTS), do some voice cloning, and gain an understanding what is needed to create something like this for profit.

Before you think about utilizing the specific software and model used in this article for business purposes, know that you cannot. The license for the model used here is publicly accessible but is not for commercial use.

Let’s begin.

Requirements

  • Apple Silicon Mac
  • Python 3.9

Setup

Source Code

We begin our journey by getting a copy of the coqui-tts open source software. This is the software with a pre-trained and publicly accessible voice cloning model.

git clone https://github.com/coqui-ai/tts

After getting a copy, we need to go inside the directory.

cd tts

Shell

For the most part, this works in the bash shell. Ensure that the output of the command below is \-bash

echo $0

If your default shell is not bash, you need to make sure that you are using bash before you proceed.

Python Virtual Environment

The Python requirement is due to the software being written in python. And in this particular scenario, we need to ensure our dependencies are specific to our software. To do this, we’ll use the built-in module venv. The more feature packed virtualenv will also work but for this article’s purpose, venv is enough.

python3.9 -m venv .tts 

The .tts above is an arbitrary name to make it easier later to track where we are or in what python environment we are operating in. This is also typically what you want to name your environment, thus, we call this environment name.

Activating Environment

This step will run a script which will set environment variables or paths so that needed python is effectively for the specific project only. To activate an environment you need to be in the directory where you created the environment.

source .tts/bin/activate

After executing the command above, you will notice that your terminal prompt is now prefixed with the environment name. In this case, .tts and your prompt will look like:

(.tts) ~/tts  $ 

Deactivating Environment

Whenever you stop working on this mini-project or will start working on a different python project, you need to deactivate this environment. This is to ensure that other python project requirement does not mess up this environment.

deactivate

Executing the command above will restore the global python environment path/variables and restore your prompt. But unlike activating an environment, you can be anywhere in the system and be able to deactivate it.

TTS Installation

The first run assumes that you have your virtual environment activated.

TTS Requirements

Most if not all application have requirements and dependencies. It’s now time to install them. Python project will have a requirements.txt file generated to ensure that development, staging, and production systems can be created matching exactly what worked so far. Other project will have several requirements file suffixed based on the environment it is meant to be running on. This project is an example of how that is done in python.

Since we are not developing or contributing feature to this project, we only need to be concerned about the requirements file for running the application locally.

pip install -r requirements.txt

If you’re the type to read through the article or how-to before executing things, you might think you can skip this step and just immediately Install to Environment. Do not skip this step, there will be modules that will fail to build if you skip this.

Install to Environment

Finally time to install this application!

make install

A lot goes into this step but I won’t write about them. If you’re curious what makes make a staple in most build steps, just take a look at the file and go into the rabbit hole.

After make install completes, tts command is now available for the environment.

Command Options/Flag

Did I write “Easy” on the title? It is. Just follow along. Now that our shell, environment, and application is setup correctly, we should be able to run tts

tts

Executing the above should show the options/flags that is available for this particular program. The following are what we are interested in for now.

  • --model_name (the voice synthesis model to be used)
  • --language_idx (the language code, i.e.: en, ja)
  • --text (the text to be converted to speech)
  • --use_cuda (default false, set to true only if you have GPU with CUDA)
  • --speaker_wav (audio sources or voice recording of the person you want to clone a voice, 6 seconds or slightly longer)
  • --out_path (the path and file name of the output, .wav commonly)

--model_name

As promised in the article’s title, we are doing Voice Cloning. The model we are interested in is xtts_v2, this is the model that gain significant social media buzz around the end of the 1Q of 2024. This boasts of an impressive 6 seconds source requirement only. You only need 6 seconds of someone’s voice to clone them. In the exercise that I did on this application I found that multiple source and audio setting/quality matters. I’ll talk about them later.

To get a list of model names, execute:

tts --list_models

and the model we are interested in is: tts_models/multilingual/multi-dataset/xtts_v2

--language_idx

To get a list of language idxs for the model we are interested in, execute:

tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 --list_language_idxs

Once you confirm that the target language you need is supported, i.e. ja for Japanese, our command will now look like:

tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 --language_idx ja

--text

This is the text that we would want to be synthesized into voice.

tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 --language_idx ja --text "チュートリアルに従っていただきありがとうございます!" 

Sadly, the system that we are playing around with doesn’t support SSML (Speech Synthesis Markup Language) which was designed to control various speech synthesis aspects such as pronunciation, volume, pitch, speed, and intonation.

--use_cuda

One of the major hurdles for voice synthesis is the availability of CUDA or ROCm in graphic cards - the price of graphic cards together with power consumption and cost to run.

For Apple Silicon Mac, we are still able to do voice synthesis with --use_cuda false or simply not specifying this flag as it is false by default.

For non Apple Silicon Mac, or any other PC without GPU, your system may not be able to run tts

--speaker_wav

This is an integral part in the cloning process. While this application’s source code and documentation is capable of referencing only one file to clone a voice, my experience playing with it has me collecting at least 3 good speaker wav to get a good voice resemblance.

The source recording should contain very minimal word pauses. I had to manually edit the source files to trim very long pauses between words. If I did not do it, the synthesized voice would sound very robotic. You can use Audacity or find a program to programmatically trim pauses between words, not pauses between sentences.

I also found that a stereo source audio would sometime lead to a voice synthesis that has echo or reverberation.

To reference multiple source wav, you use a wildcard *.wav i.e.: training/speaker_name/*.wav

tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 --language_idx ja --speaker_wav training/speaker_name/*.wav --text "チュートリアルに従っていただきありがとうございます!" 

--out_path

This is simply the output directory - relative or absolute path, and/or the filename.

tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 --language_idx ja --speaker_wav training/speaker_name/*.wav --text "チュートリアルに従っていただきありがとうございます!" --out_path thank-you.wav

The output is a WAVE file. You can convert it later on using ffmpeg as needed.

First Run

Now that we have the full command, executing it for the first time for each model will prompt you with something similar.

 > You must confirm the following:
 | > "I have purchased a commercial license from Coqui: licensing@coqui.ai"
 | > "Otherwise, I agree to the terms of the non-commercial CPML: https://coqui.ai/cpml" - [y/n]
 | | > 

Agree or you will not be able to use the model. Like I informed you at the start of this article. The use of this model is for non-commercial only.

It however also pointed us to where we can inquire for commercial use license.

xtts_v2 is 1.87G in size. Upon agreeing and pressing enter, it will automatically download the needed model.

> Downloading model to /Users/[user]/Library/Application Support/tts/tts_models--multilingual--multi-dataset--xtts_v2
 100%|██████████████████████████████████████████████████████| 1.87G/1.87G
 100%|██████████████████████████████████████████████████████| 4.37k/4.37k
 100%|██████████████████████████████████████████████████████| 361k/361k
 100%|██████████████████████████████████████████████████████| 32.0/32.0
 100%|██████████████████████████████████████████████████████| 7.60M/7.75M 

Output

All that’s left for you to do now is to evaluate if the synthesized voice sounds like the person being voice cloned.

The file should be where or what you put in the --out_path option.

Easy, Right?

Easy right?! Have fun voice cloning. Not yet convinced? Alright! Execute:

pip3.9 install tts

This would have installed the package fast and easy but where’s the adventure and fun in that?

With direct access to the source, you should be able to learn about the components of Natural Language Processing (NLP) and start learning it for future use or learn how to create/train a custom model.

Use-Cases

  • Dynamic Birthday Greeting
  • Home Assistant Voice
  • Love One Greeting
  • Fan Greeting

Disclaimer

The author, Fullspeed Technologies Inc., and coqui.ai bears no responsibility to any misuse of the tools or technology used in this article.