1. 程式人生 > >Microsoft Azure——文字轉語音(TTS) REST API 使用教程

Microsoft Azure——文字轉語音(TTS) REST API 使用教程

最近的工作需要用到文字轉語音(Text-to-Speech, TTS),故簡單地研究了現有的技術,在此與大家分享。

Azure上,語音部分的文件寫的較為詳細,包含各種功能,如TTS API,TTS SDK,自定義語音模型進行文字轉語音等。但是沒有整體的、提綱挈領般的介紹,看完文件極有可能仍然不知如何下手。本文將一步步地介紹,如何從0開始使用Azure TTS API(以後有時間,再補充SDK的使用方法)。我們希望實現的效果是,輸入一段文字,呼叫API後,返回給我們一段wav格式的音訊,播放後,即為之前輸入的文字。

  • Step1:註冊Microsoft Azure賬號

  • Step2:得到終結點和金鑰

  • Step3:下載示例程式碼

呼叫API是通過以下方式:本地程式傳送HTTP請求(包含需要轉換的文字)至微軟伺服器,經過身份驗證後,伺服器返回轉換後的音訊至本地。呼叫的程式可以在GitHub上下載:https://github.com/Azure-Samples/Cognitive-Speech-TTS。Samples-Http資料夾下有各種語言如Android, C#, Java, Node.js, PHP, Python,Ruby等的原始碼。此處以Python版(Python3)為例,接著介紹。

開啟Python資料夾後,有兩個檔案:TTSSample.py和README。該py檔案還不能直接執行,需做小小的修改(如何修改,下面再說。文末會附上修改後的原始碼)。修改後,執行該檔案,程式會自動完成上一段敘述的所有操作,並把返回的結果儲存在“data”中。

TTSSample.py檔案需要做如下的修改(可參看README):

(1)apiKey = "Your api key goes here",把引號中的內容替換為STEP2圖片中的金鑰1或者金鑰2。

(2)檢查程式碼中終結點是否與STEP2圖片中的終結點相同。比如程式中的AccessTokenHost = "westus.api.cognitive.microsoft.com",其中的“westus”就是終結點。再檢視您賬戶中申請時分配的終結點名稱,如STEP2圖中的“westus”。若二者一致,則跳至下一步;若不一致,需要修改(只修改程式碼中的“westus”就行,“.api.com.......”不需要修改)。

同理,程式中“conn = http.client.HTTPSConnection("westus.tts.speech.microsoft.com")”做同樣的修改。若您的終結點不是westus,則把程式中的westus替換為您的終結點名稱。

終結點有3個:

美國西部 https://westus.tts.speech.microsoft.com/cognitiveservices/v1
東亞 https://eastasia.tts.speech.microsoft.com/cognitiveservices/v1
北歐 https://northeurope.tts.speech.microsoft.com/cognitiveservices/v1

(4)程式中的“voice.text”後的內容就是希望轉成音訊的文字內容,可以根據實際需求做修改。

(5)這時就能執行TTSSample.py。程式正常執行的返回值應該為"200 OK"。如果發生錯誤,則會有以下狀態程式碼:

程式碼 Description 問題
400 錯誤的請求 必需引數缺失、為空或為 null。 或者,傳遞給必需引數或可選引數的值無效。 常見問題是標頭太長。
401 未授權 請求未經授權。 請檢查確保訂閱金鑰或令牌有效。
413 請求實體太大 SSML 輸入超過了 1024 個字元。
502 錯誤的閘道器 網路或伺服器端問題。 也可能表示標頭無效。

假設前面的操作都沒問題,也得到了200的返回值,那麼轉換後的音訊在哪?答案是在程式的"data"這個變數中。"data"中的資料就是TTS轉換後的音訊,我們需要把它寫為wav格式才能得到最終的音訊。具體的操作見文末的程式碼。

附件:修改後的TTSSample.py檔案(當然,您需要把apiKey修改為您的金鑰;檢查Python是否含有"wave"這個包),output.wav就是文字轉換後的音訊。

#! /usr/bin/env python3

# -*- coding: utf-8 -*-

###
#Copyright (c) Microsoft Corporation
#All rights reserved. 
#MIT License
#Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the ""Software""), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
#The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
#THE SOFTWARE IS PROVIDED *AS IS*, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
###
import http.client, urllib.parse, json
from xml.etree import ElementTree
import wave
# Note: new unified SpeechService API key and issue token uri is per region
# New unified SpeechService key
# Free: https://azure.microsoft.com/en-us/try/cognitive-services/?api=speech-services
# Paid: https://go.microsoft.com/fwlink/?LinkId=872236
apiKey = "Your api key goes here"

params = ""
headers = {"Ocp-Apim-Subscription-Key": apiKey}

#AccessTokenUri = "https://westus.api.cognitive.microsoft.com/sts/v1.0/issueToken";
AccessTokenHost = "westus.api.cognitive.microsoft.com"
path = "/sts/v1.0/issueToken"

# Connect to server to get the Access Token
print ("Connect to server to get the Access Token")
conn = http.client.HTTPSConnection(AccessTokenHost)
conn.request("POST", path, params, headers)
response = conn.getresponse()
print(response.status, response.reason)

data = response.read()
conn.close()

accesstoken = data.decode("UTF-8")
print ("Access Token: " + accesstoken)

body = ElementTree.Element('speak', version='1.0')
body.set('{http://www.w3.org/XML/1998/namespace}lang', 'en-us')
voice = ElementTree.SubElement(body, 'voice')
voice.set('{http://www.w3.org/XML/1998/namespace}lang', 'en-US')
voice.set('{http://www.w3.org/XML/1998/namespace}gender', 'Male')
voice.set('name', 'Microsoft Server Speech Text to Speech Voice (en-US, Guy24KRUS)')
voice.text = 'This is a demo to call microsoft text to speech service in Python.'

headers = {"Content-type": "application/ssml+xml", 
			"X-Microsoft-OutputFormat": "riff-24khz-16bit-mono-pcm",
			"Authorization": "Bearer " + accesstoken, 
			"X-Search-AppId": "07D3234E49CE426DAA29772419F436CA", 
			"X-Search-ClientID": "1ECFAE91408841A480F00935DC390960", 
			"User-Agent": "TTSForPython"}
			
#Connect to server to synthesize the wave
print ("\nConnect to server to synthesize the wave")
conn = http.client.HTTPSConnection("westus.tts.speech.microsoft.com")
conn.request("POST", "/cognitiveservices/v1", ElementTree.tostring(body), headers)
response = conn.getresponse()
print(response.status, response.reason)

data = response.read()
conn.close()
print("The synthesized wave length: %d" %(len(data)))

f = wave.open(r"output.wav", "wb")
f.setnchannels(1)#單聲道
f.setframerate(24000)#取樣率
f.setsampwidth(2)#sample width 2 bytes(16 bits)
f.writeframes(data)
f.close()