神經網路語音合成模型介紹

阿新 • • 發佈：2018-11-07

最近一段時間主要做語音合成相關專案，學習了幾種端到端神經網路語音合成模型，在這裡做一個簡要介紹。主要內容如下：

-語音合成簡介

-線性譜與梅爾譜

- Tacotron

- Deepvoice 3

- Tacotron 2

- Wavenet

- Parallel Wavenet

- Clarinet

-總結

語音合成簡介

語音合成，Text To Speech(TTS)，顧名思義就是把一段文字轉換為語音訊號。在人工智慧的體系中銜接了自然語言處理與語音技術，在智慧音箱，兒童聊天機器人，智慧語音客服等語音相關場景中起著非常關鍵的作用。

語音合成技術從上世紀80年代電腦技術普及後就開始研究，經典的語音合成技術主要基於拼接的方法，然後調整語調，停頓，輕重等韻律引數，涉及語音學，聲學等相關知識，對我們半路出家的演算法人員來說有著較高的資料及技術門檻。但2017年3月Google提出端到端的tacotron模型[1]後，顯著降低了語音合成技術門檻，只要對語音內容文字標註後，就可以用seq2seq框架的模型結構來學習文字與語音訊譜直接的對映關係。然後利用Griffin-Lim, WORLD, Wavenet等發聲器演算法將頻譜轉換為語音。本文將對主流的幾種深度神經網路語音合成模型進行介紹。

本系列會用到的引用先放在這裡：

References:

[1] Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc Le, Yannis Agiomyrgiannakis, Rob Clark, and Rif A. Saurous. Tacotron: Towards end-to-end speech synthesis. In Interspeech, 2017.

[2] https://haythamfayek.com/2016/04/21/speech-processing-for-machine-learning.html

[3] Sercan Arik, Mike Chrzanowski, Adam Coates, Gregory Diamos, Andrew Gibiansky, Yongguo Kang, Xian Li, John Miller, Jonathan Raiman, Shubho Sengupta, and Mohammad Shoeybi. Deep voice: Real-time neural text-to-speech. arXiv preprint arXiv:1702.07825, 2017.

[4] Jose Sotelo, Soroush Mehri, Kundan Kumar, Joa ̃o Felipe Santos, Kyle Kastner, Aaron Courville, and Yoshua Bengio. Char2Wav: End-to-end speech synthesis. In ICLR2017 workshop submission, 2017.

[5] Jason Lee, Kyunghyun Cho, and Thomas Hofmann. Fully character-level neural machine translation without explicit segmentation. arXiv preprint arXiv:1610.03017, 2016.

[6] Sercan Ömer Arik, Mike Chrzanowski, Adam Coates, Gregory Frederick Diamos, Andrew Gibiansky, Yongguo Kang, Xian Li, John Miller, Andrew Y. Ng, Jonathan Raiman, Shubho Sengupta, Mohammad Shoeybi: Deep Voice: Real-time Neural Text-to-Speech. ICML 2017: 195-204

[7] Sercan Ömer Arik, Gregory F. Diamos, Andrew Gibiansky, John Miller, Kainan Peng, Wei Ping, Jonathan Raiman, Yanqi Zhou: Deep Voice 2: Multi-Speaker Neural Text-to-Speech. CoRR abs/1705.08947 (2017)

[8] Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan O.Arık, Ajay Kannan, Sharan Naran: DEEP VOICE 3: 2000-SPEAKER NEURAL TEXT-TO-SPEECH. CoRR abs/1710.07654 (2017)

[9] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. Convolutional sequence to sequence learning. arXiv preprint arXiv:1705.03122v2, 2017.

[10] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. Attention Is All You Need. arXiv:1706.03762,2017.

[11] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerry- Ryan, et al. Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions. In ICASSP, 2018.

[12] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu. WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.

[13] van den Oord, Aaron, Kalchbrenner, Nal, Vinyals, Oriol, Espeholt, Lasse, Graves, Alex, and Kavukcuoglu, Koray. Conditional image generation with PixelCNN decoders. CoRR, abs/1606.05328, 2016

[14] https://github.com/buriburisuri/speech-to-text-wavenet

[15] Tom Le Paine, Pooya Khorrami, Shiyu Chang, Yang Zhang, Prajit Ramachandran, Mark A. Hasegawa-Johnson, and Thomas S. Huang. Fast wavenet generation algorithm. CoRR, abs/1611.09482, 2016.

[16] https://devblogs.nvidia.com/nv-wavenet-gpu-speech-synthesis/

[17] A. v. d. Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals, K. Kavukcuoglu, G. v. d. Driessche, E. Lockhart, L. C. Cobo, F. Stimberg, et al. Parallel WaveNet: Fast high-fidelity speech synthesis. In ICML, 2018.

[18] Diederik P Kingma, Tim Salimans, and Max Welling. Improving variational inference with inverse autoregressive flow. arXiv preprint arXiv:1606.04934, 2016.

神經網路語音合成模型介紹

神經網路語音合成模型介紹

#####好好好好####Keras深度神經網路訓練分類模型的四種方法

CNN卷積神經網路簡單實現模型

深度學習方法（五）：卷積神經網路CNN經典模型整理Lenet，Alexnet，Googlenet，VGG，Deep Residual Learning

IJCAI 2018 基於主題資訊的神經網路作文生成模型

自己動手實現神經網路分詞模型

基於神經網路的NER模型訓練

用神經網路做分子模型是不是扯淡，f2,cl2,br2分子模型

深度神經網路簡述與Capsule介紹

BP神經網路語音訓練

一維卷積神經網路處理序列模型

Tensorflow訓練卷積神經網路並儲存模型，載入模型並匯入手寫圖片測試

Tensorflow 卷積神經網路 Inception-v3模型遷移學習花朵識別

乾貨 | 深度學習之卷積神經網路(CNN)的模型結構

神經網路當前發展分支介紹

DL06-卷積神經網路CNN經典模型整理

Deep Forest，非神經網路的深度模型，周志華老師最新之作，三十分鐘理解！

神經網路框架-Pytorch使用介紹

開源神經網路框架Caffe2全介紹

谷歌tacotron端到端的文字轉語音合成模型實踐

神經網路語音合成模型介紹

相關推薦