a Language Modeling Method to Audio Technology – Google AI Weblog

on

|

views

and

comments


Producing lifelike audio requires modeling data represented at completely different scales. For instance, simply as music builds advanced musical phrases from particular person notes, speech combines temporally native constructions, resembling phonemes or syllables, into phrases and sentences. Creating well-structured and coherent audio sequences in any respect these scales is a problem that has been addressed by coupling audio with transcriptions that may information the generative course of, be it textual content transcripts for speech synthesis or MIDI representations for piano. Nonetheless, this strategy breaks when making an attempt to mannequin untranscribed facets of audio, resembling speaker traits essential to assist folks with speech impairments recuperate their voice, or stylistic elements of a piano efficiency.

In “AudioLM: a Language Modeling Method to Audio Technology”, we suggest a brand new framework for audio era that learns to generate lifelike speech and piano music by listening to audio solely. Audio generated by AudioLM demonstrates long-term consistency (e.g., syntax in speech, melody in music) and excessive constancy, outperforming earlier techniques and pushing the frontiers of audio era with functions in speech synthesis or computer-assisted music. Following our AI Rules, we have additionally developed a mannequin to establish artificial audio generated by AudioLM.

From Textual content to Audio Language Fashions

In recent times, language fashions skilled on very giant textual content corpora have demonstrated their distinctive generative skills, from open-ended dialogue to machine translation and even common sense reasoning. They’ve additional proven their capability to mannequin different indicators than texts, such as pure pictures. The important thing instinct behind AudioLM is to leverage such advances in language modeling to generate audio with out being skilled on annotated information.

Nonetheless, some challenges should be addressed when transferring from textual content language fashions to audio language fashions. First, one should deal with the truth that the info fee for audio is considerably greater, thus resulting in for much longer sequences — whereas a written sentence could be represented by just a few dozen characters, its audio waveform usually incorporates lots of of hundreds of values. Second, there’s a one-to-many relationship between textual content and audio. Which means that the identical sentence could be rendered by completely different audio system with completely different talking kinds, emotional content material and recording situations.

To beat each challenges, AudioLM leverages two sorts of audio tokens. First, semantic tokens are extracted from w2v-BERT, a self-supervised audio mannequin. These tokens seize each native dependencies (e.g., phonetics in speech, native melody in piano music) and international long-term construction (e.g., language syntax and semantic content material in speech, concord and rhythm in piano music), whereas closely downsampling the audio sign to permit for modeling lengthy sequences.

Nonetheless, audio reconstructed from these tokens demonstrates poor constancy. To beat this limitation, along with semantic tokens, we depend on acoustic tokens produced by a SoundStream neural codec, which seize the main points of the audio waveform (resembling speaker traits or recording situations) and permit for high-quality synthesis. Coaching a system to generate each semantic and acoustic tokens leads concurrently to excessive audio high quality and long-term consistency.

Coaching an Audio-Solely Language Mannequin

AudioLM is a pure audio mannequin that’s skilled with none textual content or symbolic illustration of music. AudioLM fashions an audio sequence hierarchically, from semantic tokens as much as nice acoustic tokens, by chaining a number of Transformer fashions, one for every stage. Every stage is skilled for the subsequent token prediction primarily based on previous tokens, as one would practice a textual content language mannequin. The primary stage performs this process on semantic tokens to mannequin the high-level construction of the audio sequence.

Within the second stage, we concatenate the complete semantic token sequence, together with the previous coarse acoustic tokens, and feed each as conditioning to the coarse acoustic mannequin, which then predicts the longer term tokens. This step fashions acoustic properties resembling speaker traits in speech or timbre in music.

Within the third stage, we course of the coarse acoustic tokens with the nice acoustic mannequin, which provides much more element to the ultimate audio. Lastly, we feed acoustic tokens to the SoundStream decoder to reconstruct a waveform.

After coaching, one can situation AudioLM on just a few seconds of audio, which allows it to generate constant continuation. As a way to showcase the final applicability of the AudioLM framework, we think about two duties from completely different audio domains:

  • Speech continuation, the place the mannequin is predicted to retain the speaker traits, prosody and recording situations of the immediate whereas producing new content material that’s syntactically appropriate and semantically constant.
  • Piano continuation, the place the mannequin is predicted to generate piano music that’s coherent with the immediate when it comes to melody, concord and rhythm.

Within the video under, you’ll be able to take heed to examples the place the mannequin is requested to proceed both speech or music and generate new content material that was not seen throughout coaching. As you hear, notice that every thing you hear after the grey vertical line was generated by AudioLM and that the mannequin has by no means seen any textual content or musical transcription, however slightly simply discovered from uncooked audio. We launch extra samples on this webpage.

To validate our outcomes, we requested human raters to take heed to quick audio clips and resolve whether or not it’s an unique recording of human speech or an artificial continuation generated by AudioLM. Based mostly on the scores collected, we noticed a 51.2% success fee, which isn’t statistically considerably completely different from the 50% success fee achieved when assigning labels at random. Which means that speech generated by AudioLM is tough to tell apart from actual speech for the typical listener.

Our work on AudioLM is for analysis functions and we now have no plans to launch it extra broadly at the moment. In alignment with our AI Rules, we sought to grasp and mitigate the likelihood that individuals may misread the quick speech samples synthesized by AudioLM as actual speech. For this goal, we skilled a classifier that may detect artificial speech generated by AudioLM with very excessive accuracy (98.6%). This exhibits that regardless of being (nearly) indistinguishable to some listeners, continuations generated by AudioLM are very simple to detect with a easy audio classifier. It is a essential first step to assist shield towards the potential misuse of AudioLM, with future efforts doubtlessly exploring applied sciences resembling audio “watermarking”.

Conclusion

We introduce AudioLM, a language modeling strategy to audio era that gives each long-term coherence and excessive audio high quality. Experiments on speech era present not solely that AudioLM can generate syntactically and semantically coherent speech with none textual content, but additionally that continuations produced by the mannequin are nearly indistinguishable from actual speech by people. Furthermore, AudioLM goes effectively past speech and may mannequin arbitrary audio indicators resembling piano music. This encourages the longer term extensions to different sorts of audio (e.g., multilingual speech, polyphonic music, and audio occasions) in addition to integrating AudioLM into an encoder-decoder framework for conditioned duties resembling text-to-speech or speech-to-speech translation.

Acknowledgments

The work described right here was authored by Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Olivier Teboul, David Grangier, Marco Tagliasacchi and Neil Zeghidour. We’re grateful for all discussions and suggestions on this work that we acquired from our colleagues at Google.

Share this
Tags

Must-read

Common Motors names new CEO of troubled self-driving subsidiary Cruise | GM

Common Motors on Tuesday named a veteran know-how government with roots within the online game business to steer its troubled robotaxi service Cruise...

Meet Mercy and Anita – the African employees driving the AI revolution, for simply over a greenback an hour | Synthetic intelligence (AI)

Mercy craned ahead, took a deep breath and loaded one other process on her pc. One after one other, disturbing photographs and movies...

Tesla’s worth drops $60bn after traders fail to hail self-driving ‘Cybercab’ | Automotive business

Tesla shares fell practically 9% on Friday, wiping about $60bn (£45bn) from the corporate’s worth, after the long-awaited unveiling of its so-called robotaxi...

Recent articles

More like this

3 COMMENTS

  1. Забота о жилище – это забота о благополучии. Теплоизоляция стен – это не только изысканный облик, но и гарантия сохранения тепла в вашем уголке уюта. Профессионалы, группа специалистов, предлагаем вам сделать ваш дом в прекрасное место для жизни.
    Наши творческие работы – это не просто тепловая обработка, это творческий подход к каждому деталю. Мы нацелены на гармонии между красотой и практичностью, чтобы ваше жилище превратилось не только теплым и стильным, но и великолепным.
    И самое важное – доступные расценки! Мы верим, что высококачественные услуги не должны быть дорогим удовольствием. [url=https://ppu-prof.ru/]Утепление фасада цена с материалом[/url] начинается всего от 1250 рублей за кв. метр.
    Инновационные технологии и материалы высокого стандарта позволяют нам создавать теплоизоляцию, которая долго служит и надежна. Позабудьте о проблемах с холодом стен и избежите дополнительных расходов на отопление – наше утепление станет вашим надежным препятствием перед холодом.
    Подробнее на [url=https://ppu-prof.ru/]http://www.ppu-prof.ru/[/url]
    Не откладывайте на потом заботу о приятности в вашем доме. Обращайтесь к профессионалам, и ваш уголок станет настоящим архитектурным шедевром, которое согреет вас не только теплом. Вместе мы создадим место для жизни, где вам будет по-настоящему уютно!

  2. Наша группа опытных исполнителей приготовлена предоставить вам передовые средства, которые не только ассигнуруют надежную протекцию от холодильности, но и дарят вашему дому элегантный вид.
    Мы трудимся с новейшими материалами, подтверждая продолжительный срок использования и великолепные результирующие показатели. Изоляция наружных стен – это не только экономия тепла на отапливании, но и ухаживание о экологии. Экологичные технические средства, какие мы используем, способствуют не только зданию, но и поддержанию природных ресурсов.
    Самое первоочередное: [url=https://ppu-prof.ru/]Утепление дома цена за квадратный метр работа[/url] у нас открывается всего от 1250 рублей за квадратный метр! Это доступное решение, которое преобразит ваш резиденцию в действительный тепличный локал с небольшими тратами.
    Наши проекты – это не просто теплоизоляция, это создание территории, в где каждый элемент показывает ваш свой модель. Мы возьмем во внимание все твои просьбы, чтобы осуществить ваш дом еще еще более удобным и привлекательным.
    Подробнее на [url=https://ppu-prof.ru/]официальном сайте[/url]
    Не откладывайте заботу о своем ларце на потом! Обращайтесь к квалифицированным работникам, и мы сделаем ваш дом не только тепличным, но и более элегантным. Заинтересовались? Подробнее о наших делах вы можете узнать на веб-сайте. Добро пожаловать в сферу спокойствия и стандартов.

  3. Мы служба SEO-экспертов, занимающихся увеличением трафика и улучшением рейтинга вашего сайта в поисковых системах.
    Наша команда добились впечатляющих результатов и готовы поделиться с вами нашими знаниями и опытом.
    Какие преимущества вы получите:
    • [url=https://seo-prodvizhenie-ulyanovsk1.ru/]битрикс продвижение сайта[/url]
    • Глубокий анализ вашего сайта и формирование индивидуального плана продвижения.
    • Улучшение контента и технических параметров вашего сайта для максимального эффекта.
    • Регулярное отслеживание и анализ результатов, с целью постоянного улучшения вашего онлайн-присутствия.
    Подробнее [url=https://seo-prodvizhenie-ulyanovsk1.ru/]https://seo-prodvizhenie-ulyanovsk1.ru/[/url]
    Многие наши клиенты отмечают улучшения: рост посещаемости, улучшение рейтинга в поисковых запросах и, конечно, увеличение прибыли. У нас есть возможность предоставить вам бесплатную консультацию, для обсуждения ваших потребностей и разработки стратегии продвижения, соответствующей вашим целям и финансовым возможностям.
    Не упустите возможность увеличить прибыль вашего бизнеса в онлайн-мире. Свяжитесь с нами уже сегодня.

LEAVE A REPLY

Please enter your comment!
Please enter your name here