TikTok Parent Unveils PolyVoice, Speech-to-Speech Translation with Language Models

TikTok Parent Unveils PolyVoice, Speech-to-Speech Translation with Language Models ByteDance, TikTok’s parent company, is stepping up in the speech-to-speech translation (S2S) game with its newly proposed PolyVoice – a language model-based framework. Announced in a research paper on June 13, 2023, the China-based tech company introduces a decoder-only model to enable direct translation, diverging from the traditional encoder-decoder modeling, which remains prevalent in speech modeling. As noted in the Slator Interpreting Services and Technology Report, published in late 2022, research and development activity in S2S translation is booming. Meta has contributed to data collection through the release of a large-scale multilingual corpus. Rival tech giant Google has been active in technological

development demonstrated by the release of its fully unsupervised Translatotron3 model. A key feature of PolyVoice is its ability to generate and use “discretized speech units”, which allows for transforming the continuous stream of spoken language into digestible, intelligent fragments. Moreover, this process takes place in a fully unsupervised manner. It efficiently filters the important information inherent to the speech and represents them in small chunks called semantic units. This feature is particularly useful for languages with no writing system since the text-based approaches usually appear to be inadequate for these languages. The Two Pillars PolyVoice integrates two language models: a translation language model and a speech synthesis language model. The first model is responsible for conveying the meaning of the source speech into the target language. The second language model, in turn, generates the target speech making sure that the target output mimics the voice and other characteristics inherent to the source speaker. The ability to clone the voice and the speaking style of the original speech is ensured by an approach based on Microsoft’s Voice Replicator VALL-E X celebrated for its ability to replicate the nuances of human speech. The system cleverly merges the semantic units of the original and translated content with the source audio elements. This combined sequence is processed by an audio language model predicting how the translated text should sound. Finally, this model transforms these audio predictions into a playback-ready format, effectively synthesizing the translated speech. PolyVoice refrains from the conventional two-step encoder-decoder model. It relies on a novel decoder-only approach, which makes it possible to translate the source speech into the target language without intermediate representations. This is an attempt to streamline the translation process, which may lead to lower latency and more natural output.

Unwritten, No Problem From the perspective of global communication, the most notable takeaway from PolyVoice is its ability to support unwritten languages. It can create new communicative perspectives for the communities whose languages have been predominantly oral. Furthermore, the advanced audio language model of PolyVoice makes it possible to retain the original speaker’s voice and style making the translations feel more natural and personal. As for the modeling standpoint, the innovative decoder-only model can make a lasting impact on the whole speech translation process eliminating the well-known problems associated with conventional modeling i.e. error propagation, latency, paralinguistics information loss, etc.

TikTok Parent Unveils PolyVoice, Speech-to-Speech Translation with Language Models

TikTok Parent Unveils PolyVoice, Speech-to-Speech Translation with Language Models

Presentation Transcript

Speech-to-Speech Translation: A New Direction for the Speech Industry

Global Speech-to-speech Translation Market 2012-2016

Speech-Language Pathology

Speech vs. Language

Speech and language

Speech and Language

AVIVAVOZ: technologies for speech-to-speech translation

Speech Language Pathology

Language and Speech

The Use of Speech in Speech-to-Speech Translation

Machine Translation Speech Translation

Speech/Language Function

Language Models For Speech Recognition

Speech and Language

Speech therapy activites- speech and language

Speech-Language Impairments

High-quality Speech Translation for Language Learning

Speech Perception Models

Speech and Language

Anywhere Speech & Language