Chainlit集成Dashscope实现语音交互网页对话AI应用

前言

本篇文章讲解和实战，如何使用Chainlit集成Dashscope实现语音交互网页对话AI应用。实现方案是对接阿里云提供的语音识别SenseVoice大模型接口和语音合成CosyVoice大模型接口使用。针对SenseVoice大模型和CosyVoice大模型，阿里巴巴在github提供的有开源代码，也可以自己本地部署SenseVoice大模型和CosyVoice大模型接口服务。特别提及一下，阿里云上还有另外一套，语音识别和语音合成的接口，Paraformer语音识别和Sambert语音合成，相比之下个人觉得不如SenseVoice大模型和CosyVoice大模型。中间踩了一个坑，浪费我好几天研究，就是阿里云平台上Paraformer语音识别的接口，可以直接传文件路径或者文件流而SenseVoice大模型的接口。所以我一直中意用Paraformer语音识别的接口，但是识别的语音文件，老是识别不出来，一开始我觉得是哪些参数没传对，或者是我麦克风生成语音文件有问题，老是语音识别不出来，后来看了一些文档，支持的音频采样率，只有16khz的，而我音频格式不是。所以就换成SenseVoice大模型的接口，只接受文件的url不支持本地文件直接上传，这个接口需要需要将音频文件先上传到oss等生成一个可以访问文件的url。一开始我觉得会很慢，但是实测起来，速度还行。

Dashscope

DashScope 是阿里云推出的一款模型服务产品，旨在简化人工智能（AI）模型的应用与部署过程。它针对主流的AI大模型进行了标准化封装，提供了API接口，使得开发者能够轻松调用这些模型，并进行推理、训练、微调等操作。DashScope 的主要特点包括：

丰富的模型选择：DashScope 提供了多种类型的模型，涵盖自然语言处理、计算机视觉等多个领域，满足不同场景下的需求。
简便的集成方式：通过Python/Java SDK或HTTP请求，开发者可以方便地将AI模型集成到自己的应用程序中。
开箱即用：无需深入了解模型内部机制，用户可以直接调用API完成复杂的AI任务。
高效的性能优化：在模型推理优化、高效微调训练等方面积累了大量的技术实力，提升了模型使用的效率和效果。
灵活的弹性底座：改善了模型推理、微调效率低及规模响应慢的问题，提供了更加灵活的服务支持。
成本经济：提供了合理的计费模式，包括免费试用额度，帮助开发者降低成本。

此外，DashScope 还支持特定领域的高级功能，如语音合成、图像识别等，适用于广泛的行业和应用场景。通过DashScope，开发者可以获得一个强大的工具集来加速AI项目的开发和部署。

语音识别SenseVoice大模型

阿里巴巴的语音识别大模型 SenseVoice 是其在语音技术领域的一项重要成果，主要特点和能力如下：

高精度多语言语音识别：SenseVoice 能够提供高精度的语音转文字服务，支持多种语言的识别，适用于跨国界、跨文化的交流场景。
情感辨识：除了传统的语音识别外，SenseVoice 还能够识别说话者的情感状态，如高兴、悲伤、愤怒等，这对于构建更加人性化的对话系统非常有用。
音频事件检测：SenseVoice 可以检测音频中的特定事件，例如掌声、笑声、咳嗽等，这种能力在媒体分析、智能监控等领域具有广泛应用前景。
快速适应不同场景：通过先进的模型训练技术和算法优化，SenseVoice 能够快速适应不同的环境和使用场景，即使是在嘈杂的环境中也能保持较高的识别精度。
支持语音克隆：SenseVoice 具备语音克隆的能力，可以通过少量样本学习特定人的声音特征，进而生成相似的声音输出，这项技术在个性化语音助手、虚拟主播等领域有着广阔的应用空间。
开源共享：作为阿里通义实验室的一部分，SenseVoice 已经被开源，这促进了学术界和工业界之间的技术交流和进步，也为更多开发者提供了研究和应用的机会。
强大的技术支持：依托阿里巴巴深厚的技术积累和丰富的应用场景，SenseVoice 在性能、稳定性等方面得到了充分保障。

综上所述，SenseVoice 是一款集成了多项先进功能和技术的语音识别大模型，旨在通过提供高质量的语音处理服务，推动人机交互体验的进一步提升。

语音合成CosyVoice大模型

CosyVoice 是阿里通义实验室开发的一款先进的多语言语音合成大模型，旨在通过融合大规模预训练语言模型和深度学习技术，提供高质量、自然流畅的语音合成服务。以下是 CosyVoice 的主要特点和优势：

多语言支持：CosyVoice 支持包括中文、英文、日文、粤语和韩语在内的多种语言的语音合成，满足全球用户的多样化需求。
自然逼真的语音质量：通过超过15万小时的多语言语音数据训练，CosyVoice 能够生成几乎与真人无异的语音，无论是发音清晰度还是情感表达都达到了很高的水平。
快速音色克隆：CosyVoice 可以在短短几秒钟内从提供的音频样本中学习并复制特定的音色特性，使用户能够轻松创建个性化的语音内容。
情感和韵律控制：CosyVoice 允许对合成语音的情感色彩和节奏进行细致调整，从而更好地适应不同的应用场景和内容需求。
开源开放：CosyVoice 已经对外开源，不仅为开发者提供了使用这一先进技术的机会，同时也促进了学术界和工业界之间的技术交流和发展。
灵活的部署选项：CosyVoice 提供了从云端服务到本地部署的多种解决方案，用户可以根据自身需求选择最适合的部署方式。
全面的技术文档和支持：为了帮助用户更好地理解和使用 CosyVoice，官方提供了详细的使用教程和技术文档，以及持续的技术支持。

总之，CosyVoice 是一个功能强大、使用灵活的语音合成工具，它不仅能够为个人用户提供高质量的语音内容创作能力，同时也为企业和开发者提供了实现语音相关应用和服务的重要手段。

快速上手

创建一个文件，例如“chainlit_chat”

mkdir chainlit_chat

进入 chainlit_chat文件夹下，执行命令创建python 虚拟环境空间(需要提前安装好python sdk。 Chainlit 需要python>=3.8。,具体操作，由于文章长度问题就不在叙述，自行百度)，命令如下：

python -m venv .venv

这一步是避免python第三方库冲突，省事版可以跳过
.venv是创建的虚拟空间文件夹可以自定义

接下来激活你创建虚拟空间，命令如下：

#linux or mac
source .venv/bin/activate
#windows
.venv\Scripts\activate

在项目根目录下创建`requirements.txt`，内容如下：

chainlit
dashscope

执行以下命令安装依赖：

pip install -r .\requirements.txt

安装后，项目根目录下会多出.chainlit 和.files文件夹和chainlit.md文件

代码创建

只使用通义千问的DashScope模型服务灵积的接口

在项目根目录下创建`.env`环境变量，配置如下：

DASHSCOPE_API_KEY="sk-api_key"

DASHSCOPE_API_KEY 是阿里dashscope的服务的APIkey，代码中使用DashScope的sdk实现，所以不需要配置base_url。默认就是阿里的base_url。
阿里模型接口地址 https://dashscope.console.aliyun.com/model
DashScope模型服务灵积文档地址 https://help.aliyun.com/zh/dashscope/

在项目根目录下创建app.py文件，代码如下：

import re
import tempfile
import time
from http import HTTPStatus
from io import BytesIO

import chainlit as cl
import dashscope.file
import requests
from chainlit.element import ElementBased
from dashscope.audio.tts_v2 import SpeechSynthesizer
from dashscope.common.constants import FilePurpose



async def text_to_speech(text: str):
    synthesizer = SpeechSynthesizer(model='cosyvoice-v1', voice='longxiaochun')
    audio = synthesizer.call(text)
    return "", audio


async def speech_to_text(audio_file):
    result = dashscope.Files.upload(file_path=audio_file,
                                    purpose=FilePurpose.assistants)
    file_id = result.output['uploaded_files'][0]['file_id']
    file_res = dashscope.Files.get(file_id)
    task_response = dashscope.audio.asr.Transcription.async_call(
        model='sensevoice-v1',
        file_urls=[file_res.output['url']],
        language_hints=['zh', 'en'],
    )
    transcribe_response = dashscope.audio.asr.Transcription.wait(task=task_response.output.task_id)
    if transcribe_response.status_code == HTTPStatus.OK:
        transcription_url = transcribe_response.output["results"][0]["transcription_url"]
        # 发送GET请求
        response = requests.get(transcription_url)
        # 检查请求是否成功
        if response.status_code == 200:
            # 使用内置的json()方法将响应体解析为字典
            data = response.json()
            print(data)
            print(data['transcripts'][0]['text'])
            text = data['transcripts'][0]['text']
            # 使用正则表达式提取标签之间的文本
            pattern = r'<\|Speech\|>(.*?)<\|/Speech\|>'
            match = re.search(pattern, text, re.DOTALL)

            if match:
                text = match.group(1).strip()
            else:
                text = ''
            return text
        else:
            print(f"Failed to retrieve data: {response.status_code}")
    return "fail"


@cl.step(type="tool", name="AI问答")
async def generate_text_answer(message: cl.Message):
    start_time = time.time()
    msg = cl.Message(content="")
    await msg.send()
    messages = [
        {'role': 'system', 'content': 'You are a helpful assistant.'},
        {'role': 'user', 'content': message.content}
    ]
    print('content', message.content)
    stream = dashscope.Generation.call(
        model="qwen-plus",
        messages=messages,
        result_format='message',
        stream=True,
        incremental_output=True
    )
    print(f"代码执行时间: {time.time() - start_time} 秒")
    for part in stream:
        if token := part.output.choices[0].message.content or "":
            await msg.stream_token(token)
    return msg


@cl.on_message
async def on_message(message: cl.Message):
    msg = await generate_text_answer(message)
    await msg.update()


@cl.on_audio_chunk
async def on_audio_chunk(chunk: cl.AudioChunk):
    if chunk.isStart:
        buffer = BytesIO()
        buffer.name = f"input_audio.{chunk.mimeType.split('/')[1]}"
        # Initialize the session for a new audio stream
        cl.user_session.set("audio_buffer", buffer)
        cl.user_session.set("audio_mime_type", chunk.mimeType)

    # For now, write the chunks to a buffer and transcribe the whole audio at the end
    cl.user_session.get("audio_buffer").write(chunk.data)


@cl.on_audio_end
async def on_audio_end(elements: list[ElementBased]):
    # Get the audio buffer from the session
    audio_buffer: BytesIO = cl.user_session.get("audio_buffer")
    audio_mime_type: str = cl.user_session.get("audio_mime_type")
    audio_buffer.seek(0)  # 将文件指针移到开头
    # 使用pydub处理音频
    try:
        with tempfile.NamedTemporaryFile(delete=False, suffix=".webm") as tmpFile:
            tmpFile.write(audio_buffer.read())
            tmpFile_path = tmpFile.name
    except Exception as e:
        print(f"Error processing audio: {e}")

    print('tmpFile_path', tmpFile_path)
    transcription = await speech_to_text(tmpFile_path)
    input_audio_el = cl.Audio(
        mime=audio_mime_type, path=tmpFile_path, name="",
    )
    message = await cl.Message(
        author="You",
        type="user_message",
        content=transcription,
        elements=[input_audio_el, *elements]
    ).send()
    print('transcription', transcription)
    msg = await generate_text_answer(message)
    output_name, output_audio = await text_to_speech(msg.content)
    output_audio_el = cl.Audio(name=output_name, auto_play=True, mime='audio/wav', content=output_audio)
    msg.elements = [output_audio_el]
    await msg.update()

这里我使用的国内阿里云的DashScope sdk 服务。
如果要保存用户聊天记录，这里生成语音文件，如果要保存起来，本地或者oss等。
代码实现的是，语音问答的时候会回复文字和语音，语音设置的默认播放，文字问答的时候，只回复文字。
代码还有一些不完善的地方，比如异常的处理还不完善，部署生产的时候记得完善

代码解读

这段代码是一个基于 chainlit 框架的聊天机器人应用，它结合了阿里云的多个服务，包括语音合成（TTS）、语音识别（ASR）和文本生成（Text Generation），实现了从语音输入到文本处理再到语音输出的完整流程。下面是各个部分的功能解读：

1. 引入必要的模块

import re
import tempfile
import time
from http import HTTPStatus
from io import BytesIO

import chainlit as cl
import dashscope.file
import requests
from chainlit.element import ElementBased
from dashscope.audio.tts_v2 import SpeechSynthesizer
from dashscope.common.constants import FilePurpose

导入了用于处理字符串、文件操作、时间测量、HTTP状态码、字节流处理、网络请求、链式元素以及阿里云 DashScope 相关模块。

2. 定义 `text_to_speech` 函数

async def text_to_speech(text: str):
    synthesizer = SpeechSynthesizer(model='cosyvoice-v1', voice='longxiaochun')
    audio = synthesizer.call(text)
    return "", audio

使用 SpeechSynthesizer 对象调用语音合成功能，将文本转换为语音。这里使用的是 CosyVoice 模型和特定的声音（longxiaochun）。

3. 定义 `speech_to_text` 函数

async def speech_to_text(audio_file):
    # 上传音频文件至阿里云
    result = dashscope.Files.upload(file_path=audio_file, purpose=FilePurpose.assistants)
    file_id = result.output['uploaded_files'][0]['file_id']
    file_res = dashscope.Files.get(file_id)
    # 调用语音转文字服务
    task_response = dashscope.audio.asr.Transcription.async_call(
        model='sensevoice-v1',
        file_urls=[file_res.output['url']],
        language_hints=['zh', 'en'],
    )
    transcribe_response = dashscope.audio.asr.Transcription.wait(task=task_response.output.task_id)
    if transcribe_response.status_code == HTTPStatus.OK:
        transcription_url = transcribe_response.output["results"][0]["transcription_url"]
        response = requests.get(transcription_url)
        if response.status_code == 200:
            data = response.json()
            text = data['transcripts'][0]['text']
            # 使用正则表达式提取标签之间的文本
            pattern = r'<\|Speech\|>(.*?)<\|/Speech\|>'
            match = re.search(pattern, text, re.DOTALL)
            if match:
                text = match.group(1).strip()
            else:
                text = ''
            return text
        else:
            print(f"Failed to retrieve data: {response.status_code}")
    return "fail"

此函数负责将音频文件上传到阿里云，并调用语音识别服务将音频转换为文本。如果识别成功，则返回识别后的文本；否则返回 "fail"。

4. 定义 `generate_text_answer` 函数

@cl.step(type="tool", name="AI问答")
async def generate_text_answer(message: cl.Message):
    start_time = time.time()
    msg = cl.Message(content="")
    await msg.send()
    messages = [
        {'role': 'system', 'content': 'You are a helpful assistant.'},
        {'role': 'user', 'content': message.content}
    ]
    stream = dashscope.Generation.call(
        model="qwen-plus",
        messages=messages,
        result_format='message',
        stream=True,
        incremental_output=True
    )
    for part in stream:
        if token := part.output.choices[0].message.content or "":
            await msg.stream_token(token)
    return msg

使用阿里云的文本生成模型 qwen-plus 根据用户的输入生成回复。此函数会将生成的每个片段逐步发送给用户，以实现流式响应。

5. 定义消息处理函数

@cl.on_message
async def on_message(message: cl.Message):
    msg = await generate_text_answer(message)
    await msg.update()

当接收到用户的消息时，调用 generate_text_answer 函数生成回复，并更新消息对象。

6. 处理音频数据

@cl.on_audio_chunk
async def on_audio_chunk(chunk: cl.AudioChunk):
    if chunk.isStart:
        buffer = BytesIO()
        buffer.name = f"input_audio.{chunk.mimeType.split('/')[1]}"
        cl.user_session.set("audio_buffer", buffer)
        cl.user_session.set("audio_mime_type", chunk.mimeType)

    cl.user_session.get("audio_buffer").write(chunk.data)

当接收到音频数据块时，将其写入内存中的缓冲区。如果是音频流的开始，则初始化缓冲区。

7. 结束音频处理

@cl.on_audio_end
async def on_audio_end(elements: list[ElementBased]):
    audio_buffer: BytesIO = cl.user_session.get("audio_buffer")
    audio_mime_type: str = cl.user_session.get("audio_mime_type")
    audio_buffer.seek(0)
    with tempfile.NamedTemporaryFile(delete=False, suffix=".webm") as tmpFile:
        tmpFile.write(audio_buffer.read())
        tmpFile_path = tmpFile.name
    transcription = await speech_to_text(tmpFile_path)
    input_audio_el = cl.Audio(mime=audio_mime_type, path=tmpFile_path, name="")
    message = await cl.Message(
        author="You",
        type="user_message",
        content=transcription,
        elements=[input_audio_el, *elements]
    ).send()
    msg = await generate_text_answer(message)
    output_name, output_audio = await text_to_speech(msg.content)
    output_audio_el = cl.Audio(name=output_name, auto_play=True, mime='audio/wav', content=output_audio)
    msg.elements = [output_audio_el]
    await msg.update()