We would like to communicate our latest automation and artificial intelligence initiative. We decided to tackle the challenge of translating training materials using tools such as OpenAI's Whisper and Tortoise TTS. Take a look at our article, which not only presents theoretical concepts but also shares practical experience. It's a step toward the future of content management in a multilingual environment. Here's how technology can support and streamline business processes, helping companies adapt to today's global market demands.
Our challenge is that we have around 40 GB of training videos in Polish which need to be translated to English. Doing this would be either extremely tedious and time consuming task or expensive to pay someone for translating it for us. It would also be a temporary solution, as repeating the process would result in facing the same dilemma. Thus, we decided to be smarter about it. There are enough free and open source projects to assemble an automatic solution that makes a computer do the tedious task, which is exactly what computers were invented for.
To create an English video from a Polish video we need to detect speech, translate it and then synthesize new speech. OpenAI’s Whisper is a great tool for detecting speech in a video and it’s also able to translate it from any language to English. Luckily, Polish is one of the languages with most hours of audio in the training set, which results in one of the best speech recognition performances[1]. Whisper returns (among some other data) detected text and it’s timestamp. Text is then passed to Tortoise TTS, which returns audio that is being overlaid on top of the video ad timestamp using pydub and FFmpeg. Do that in a loop for all filed and voilà – Your translation is going to be ready one day. One day, because the tool is slow. There are two major bottlenecks – it’s memory requirements and Tortoise. The latter is simple and unavoidable – Tortoise provides excellent quality of voice at the cost of inference speed, hence the name. To solve the former You just need to invest in more computing power, as to use large Whisper and Tortoise simultaneously You need around 13,5GB of VRAM. We only have 12GB, so some computation is being done using the CPU, which is significantly slower.
Both Whisper and Tortoise TTS occasionally make mistakes, but those problems can be mitigated. Any text to speech (with no internal text preprocessing) is going to struggle with acronyms. You can tell by the context if “THE” is the definite article or an acronym, but text to speech models have no concept of context. Thankfully, just like every transformer based model, Whisper has and it capitalizes the letters if it’s an acronym, leaving them lowercase otherwise. Because of that we can create a regular expression to find all occurrences of any number of uppercase letters followed by any number of numbers and just add spaces between all of the letters. This will make Tortoise read it letter by letter, just the way You would read an acronym. Rarely, but sometimes it pronounces things wrong, but this is unavoidable with current level of (free and open source) text to speech models. Then there are mistakes made by Whisper itself. Recognizing words from speech is no easy task, as there’s countless number of accents and pronunciation isn’t consistent, even for one speaker. Surprisingly, no problems with recognizing words has been noted by us. There are mistakes in translation, but their nature is linguistic.
Let’s look at a prominent example from the realm of ERP – Polish word “lot”. In general it means “flight”, but it’s also used to describe a “batch number” by ERP professionals. While it’s possible for a model to know that, You can’t expect a general purpose model to know any particular jargon. It should be possible to fine-tune the model such that it learns the specific terminology, but we deemed it’s too much commitment for a small, and frankly, insignificant improvement. Target audience for the training videos are experienced personnel and they will be able to understand from the context given that the problem doesn’t occur too frequently. There are two significant problems with Whisper itself. Sometimes it repeats the line multiple times – let’s look at an example.
“We can see that the OIS302 application is a very useful application.” is the speech between 1:45 and 1:51, but Whisper outputs four different texts that are either “The OIS302 application is a very useful application.” or original text in periods of time that sum to original period of time. This would be much of a problem if we only wanted to create subtitles, but with text to speech we get four overlaying audios with almost the same text – it’s incomprehensible. Thankfully Whisper doesn’t make typos, so merging it all into one segment is pretty straightforward.
Another problem of OpenAI’s model is that sometimes it splits a sentence word by word into segments of length equal to minimal resolution, which is 1 s. While technically correct, it’s very hard to listen to, especially for extended period of time.
To prevent the audience from going insane we decided to merge sentences on interpunction – if a segment doesn’t end with an interpunction the next segment is concatenated to it.
This condition is evaluated until an interpunction is found or maximum number of tokens in a segment, equal to 50, is reached. Apart from the fact that Tortoise has a maximum length of tokens to synthesize no breaks in speech are almost as tiring as excessive amounts of them. We also experimented with a few other methods of improving audience’s experience, but they didn’t provide any improvement.
The tool is built by combining multiple open source software projects and it is only right that it also will be open source. You can find the tool on our GitHub, together with instructions on how to use it.
Comments