为了实现语音方式与大语言模型的对话,需要使用语音识别(Voice2Text)和语音输出(Text2Voice)。感觉这项技术已比较成熟了,国内也有许多的机构开发这项技术,但是像寻找一个方便测试的技术居然还不容易。Google 墙了,微软需要注册,而国内的资料很少,最后选择了OpenAI 的Whisper。
Whisper 简介
Whisper是OpenAI于2022年12月发布的语音处理系统。它以英语为主,支持99种语言,包括中文。
提供了从tiny到large,从小到大的五种规格模型,适合不同场景。
Large 模型有2.88G,Basic 模型大约几百M。测试下来,Large 模型比较慢,Basic比较快。
Whisper 安装
pip install openai-whisper
安装 ffmpeg
whisper 要使用ffmpeg 程序,在windows 的PowerShell 下安装的方式:
choco install ffmpeg
其它一些模块的安装
测试的语音文件
在网络上找中文的语音文件好像不太容易,不是收费,就是文不对题,在github 上找了一个英文的语音样文件。
audio-samples.github.io
Whisper 语音转文本
import whisper
print("Start....")
whisper_model = whisper.load_model("large")
print("Begine...")
result = whisper_model.transcribe("E:/yao2024/sample-0.wav",language='en')
print(", ".join([i["text"] for i in result["segments"] if i is not None]))
程序运行时要下载相关的模型数据,花费一段时间
Langchain 语音助手
Langchain 有语音助手链,它使用pyttsx3
和speech_recognition
库分别将文本转换为语音和语音转换为文本。
speech_recognition
是一个语音识别引擎,它可以调用多个语音识别的API ,其中包括:
-
CMU Sphinx (works offline)
-
Google Speech Recognition
-
Google Cloud Speech API
-
Wit.ai
-
Microsoft Azure Speech
-
Microsoft Bing Voice Recognition (Deprecated)
-
Houndify API
-
IBM Speech to Text
-
Snowboy Hotword Detection (works offline)
-
Tensorflow
-
Vosk API (works offline)
-
OpenAI whisper (works offline)
-
Whisper API
我们选择了OpenAI_whisper 离线方式。
实验程序
pyttsx3 的实验
import pyttsx3
#语音播放
pyttsx3.speak("How are you?")
pyttsx3.speak("I am fine, thank you")
pyttsx3.speak("太行,王屋二山,方七百里,高万仞,本在冀州之南,河阳之北。")
对话程序
import speech_recognition as sr
import pyttsx3
from langchain.chat_models import ErnieBotChat
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.memory import ConversationBufferWindowMemory
llm= ErnieBotChat(model_name='ERNIE-Bot', #ERNIE-Bot
ernie_client_id='FAiHIjSQqH5gAhET3sHNTkiH',
ernie_client_secret='wlIBmWY4d2Zvrs0GyQbT3JeTXV6kdub4',
temperature=0.75,
)
template = """Assistant is a large language model trained by OpenAI.
Assistant is designed to be able to assist with a wide range of tasks, from answering simple questions to providing in-depth explanations and discussions on a wide range of topics. As a language model, Assistant is able to generate human-like text based on the input it receives, allowing it to engage in natural-sounding conversations and provide responses that are coherent and relevant to the topic at hand.
Assistant is constantly learning and improving, and its capabilities are constantly evolving. It is able to process and understand large amounts of text, and can use this knowledge to provide accurate and informative responses to a wide range of questions. Additionally, Assistant is able to generate its own text based on the input it receives, allowing it to engage in discussions and provide explanations and descriptions on a wide range of topics.
Overall, Assistant is a powerful tool that can help with a wide range of tasks and provide valuable insights and information on a wide range of topics. Whether you need help with a specific question or just want to have a conversation about a particular topic, Assistant is here to assist.
Assistant is aware that human input is being transcribed from audio and as such there may be some errors in the transcription. It will attempt to account for some words being swapped with similar-sounding words or phrases. Assistant will also keep responses concise, because human attention spans are more limited over the audio channel since it takes time to listen to a response.
{history}
Human: {human_input}
Assistant:"""
prompt = PromptTemplate(
input_variables=["history", "human_input"],
template=template
)
chatgpt_chain = LLMChain(
llm=llm,
prompt=prompt,
verbose=True,
memory=ConversationBufferWindowMemory(k=2),
)
engine = pyttsx3.init()
# 定义一个函数用于监听麦克风输入并进行处理
def listen():
r = sr.Recognizer()
with sr.Microphone() as source:
print('校准中...')
r.adjust_for_ambient_noise(source, duration=10)
# 可选参数,用于调整麦克风灵敏度
# r.energy_threshold = 200
r.pause_threshold=0.5
print('好的,开始吧!')
while (1):
text = ''
print('正在倾听...')
try:
audio = r.listen(source, timeout=10)
print('识别中...')
# 进行语音识别
text = r.recognize_whisper(audio)
print(text)
except Exception as e:
unrecognized_speech_text = f'抱歉,我没听清楚。错误信息: {e}s'
text = unrecognized_speech_text
print(text)
# 使用语言模型生成对话回复
response_text = chatgpt_chain.predict(human_input=text)
print(response_text)
# 使用语音合成引擎将回复转换为语音并播放
engine.say(response_text)
engine.runAndWait()
listen()
讲英文,回答英文,讲中文它会回答中文,但是识别同音字效果并不好。不知道如何提高同音字识别效果