F5-TTS文本语音合成模型的使用和接口封装

F5-TTS文本语音生成模型

1. F5-TTS的简介

2024年10月8日，上海交通大学团队发布，F5-TTS (A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching) 是一款基于扩散Transformer和ConvNeXt V2的文本转语音 (TTS) 模型。F5-TTS旨在生成流畅且忠实的语音，其训练速度和推理速度都得到了提升。项目还提供了一个名为E2 TTS的模型，它是论文中模型的更接近的复现版本，基于Flat-UNet Transformer。预训练模型已发布在Hugging Face和Model Scope上。

总而言之，F5-TTS是一个功能强大且易于使用的TTS模型，它结合了扩散模型和流匹配技术，实现了快速训练、快速推理和高质量的语音生成。其提供的Gradio应用和CLI工具也方便了用户的使用。项目文档较为完善，方便用户快速上手。

GitHub地址：https://github.com/SWivid/F5-TTS

论文地址：https://arxiv.org/abs/2410.06885

2.模型特点：

快速训练和推理：相比于其他模型，F5-TTS的训练和推理速度更快。

流畅逼真的语音：采用流匹配技术，生成更流畅、更自然、更忠实的语音。

基于扩散Transformer和ConvNeXt V2：利用先进的架构，提升模型性能。

多风格/多说话人生成：支持多风格和多说话人的语音生成。

提供Gradio应用：提供友好的图形用户界面，方便用户进行推理和微调。

支持语音聊天：通过集成Qwen2.5-3B-Instruct模型，支持语音聊天功能。

提供了E2 TTS模型：作为论文中模型的更接近的复现版本，方便研究者复现论文结果。

Sway Sampling：一种推理时间的流步骤采样策略，极大地提高了性能。

3.F5-TTS的安装和使用方法

环境配置

使用conda创建虚拟环境

创建一个Python 3.10的conda环境 (也可以使用virtualenv)：
conda create -n f5-tts python=3.10
conda activate f5-tts

安装PyTorch和Torchaudio依赖


安装PyTorch和Torchaudio，CUDA版本根据你的显卡选择：
pip install torch==2.3.0+cu118 torchaudio==2.3.0+cu118 --extra-index-url https：//download.pytorch.org/whl/cu118

克隆项目，安装环境依赖

git clone https://github.com/SWivid/F5-TTS.git
cd F5-TTS
pip install -e .

4.推理

提供了三种推理方式：

1、Gradio应用 (Web界面)
运行 f5-tts_infer-gradio 命令启动Gradio应用，支持基本TTS、多风格/多说话人生成和基于Qwen2.5-3B-Instruct的语音聊天。可以使用 --port 和 --host 参数指定端口和主机，使用 --share 参数生成共享链接。

2、CLI推理
使用 f5-tts_infer-cli 命令进行命令行推理。需要指定模型名称 (–model)、参考音频路径 (–ref_audio)、参考文本 (–ref_text) 和要生成的文本 (–gen_text)。可以使用配置文件 (-c) 指定参数。支持多语音生成。

# Launch a Gradio app (web interface)
f5-tts_infer-gradio

# Specify the port/host
f5-tts_infer-gradio --port 7860 --host 0.0.0.0

# Launch a share link
f5-tts_infer-gradio --share

模型文件下载可能出现连接超时等网络问题

访问模型文件镜像站

   https://huggingface.co/SWivid/F5-TTS

访问国内镜像站

  https://hf-mirror.com/

方法：huggingface-cli

huggingface-cli 是 Hugging Face 官方提供的命令行工具，自带完善的下载功能。

安装依赖

      pip install -U huggingface_hub

设置环境变量

Linux

 export HF_ENDPOINT=https://hf-mirror.com

使用文本编辑器以管理员权限打开/etc/environment文件。你可以使用nano或者vim。例如，使用nano的命令如下：
- sudo nano /etc/environment
在文件中添加你的环境变量。在你的情况下，添加这行：

HF_ENDPOINT="https://hf-mirror.com"

保存并关闭文件。如果你使用的是nano，可以通过按Ctrl+X，然后按Y确认保存，最后按Enter键来保存文件。
为了使变更立即生效，你可以注销并重新登录，或者在终端中运行以下命令来重载环境变量：

source /etc/environment

这样，HF_ENDPOINT环境变量就被设置为永久的了，并且每次启动时都会自动加载。

3.1 下载模型示例

 huggingface-cli download --resume-download SWivid/F5-TTS --local-dir /home/x1/F5-TTS/ckpts/

3.2 下载数据集示例

huggingface-cli download --repo-type dataset --resume-download wikitext --local-dir wikitext

可以添加 --local-dir-use-symlinks False 参数禁用文件软链接，这样下载路径下所见即所得，详细解释请见上面提到的教程。

5.启动Gradio应用 (Web界面)

命令

f5-tts_infer-gradio --port 13066 --host 0.0.0.0

在这里插入图片描述

6.编写一个推理的接口程序

import re
from flask import Flask, request, jsonify, send_file
import io
import tempfile
import soundfile as sf
import os
from f5_tts.infer.utils_infer import (
    preprocess_ref_audio_text,
    infer_process,
    remove_silence_for_generated_wav
)
from f5_tts.model import DiT
from f5_tts.infer.utils_infer import load_vocoder, load_model

app = Flask(__name__)

# Paths to model and vocab files
MODEL_PATH = "/home/x1/F5-TTS/ckpts/F5TTS_Base/model_1200000.safetensors"
VOCAB_PATH = "/home/x1/F5-TTS/ckpts/F5TTS_Base/vocab.txt"

# Initialize TTS model and vocoder
F5TTS_ema_model = None
vocoder = load_vocoder()

def load_f5tts_model():
    global F5TTS_ema_model
    if F5TTS_ema_model is None:
        F5TTS_model_cfg = dict(dim=1024, depth=22, heads=16, ff_mult=2, text_dim=512, conv_layers=4)
        F5TTS_ema_model = load_model(DiT, F5TTS_model_cfg, MODEL_PATH, vocab_file=VOCAB_PATH)

load_f5tts_model()

def convert_to_chinese_date(text):
    """Convert dates and numbers in the text to Chinese format."""
    num_map = {"0": "零", "1": "一", "2": "二", "3": "三", "4": "四",
               "5": "五", "6": "六", "7": "七", "8": "八", "9": "九"}
    
    def number_to_chinese(match):
        number = match.group()
        if len(number) == 1:  # 单个数字
            return num_map[number]
        elif len(number) == 2:  # 两位数
            if number.startswith("1"):  # 特殊处理10-19
                return "十" + (num_map[number[1]] if number[1] != "0" else "")
            else:
                return num_map[number[0]] + "十" + (num_map[number[1]] if number[1] != "0" else "")
        else:
            return "".join(num_map[digit] for digit in number)  # 处理三位及以上的数字

    # 将日期格式（如12月、10日）处理为中文读法
    text = re.sub(r'\d+', number_to_chinese, text)
    return text

@app.route('/generateAudio', methods=['POST'])
def synthesize():
    # Validate and parse input
    if 'gen_text' not in request.form:
        return jsonify({"error": "Missing required parameter: 'gen_text'"}), 400

    gen_text = request.form['gen_text']
    ref_text = request.form.get('ref_text', '')
    ref_audio_path = None

    if 'ref_audio' in request.files:
        # Save uploaded reference audio file to a temporary location
        ref_audio = request.files['ref_audio']
        with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as temp_audio_file:
            ref_audio.save(temp_audio_file.name)
            ref_audio_path = temp_audio_file.name
    elif 'ref_audio_path' in request.form:
        # Use reference audio path provided in the form
        ref_audio_path = request.form['ref_audio_path']
        if not os.path.exists(ref_audio_path):
            return jsonify({"error": f"File not found: {ref_audio_path}"}), 400

    if not ref_audio_path:
        return jsonify({"error": "Missing required parameter: 'ref_audio' or 'ref_audio_path'"}), 400

    try:
        # Convert dates in gen_text to Chinese format
        gen_text = convert_to_chinese_date(gen_text)

        # Preprocess reference audio and text
        ref_audio_data, ref_text = preprocess_ref_audio_text(ref_audio_path, ref_text)

        # Synthesize speech
        final_wave, final_sample_rate, _ = infer_process(
            ref_audio_data,
            ref_text,
            gen_text,
            F5TTS_ema_model,
            vocoder,
            cross_fade_duration=0.15,
            speed=1.0,
        )

        # Remove silences from generated audio
        with tempfile.NamedTemporaryFile(delete=False, suffix=".wav") as temp_generated_audio:
            sf.write(temp_generated_audio.name, final_wave, final_sample_rate)
            remove_silence_for_generated_wav(temp_generated_audio.name)
            final_wave, _ = sf.read(temp_generated_audio.name)

        # Convert synthesized audio to bytes
        audio_buffer = io.BytesIO()
        sf.write(audio_buffer, final_wave, final_sample_rate, format='WAV')
        audio_buffer.seek(0)

        return send_file(
            audio_buffer,
            as_attachment=True,
            download_name="synthesized_audio.wav",
            mimetype="audio/wav"
        )

    except Exception as e:
        return jsonify({"error": str(e)}), 500


if __name__ == '__main__':
    app.run(host='0.0.0.0', port=13666, debug=True)

注意在读取日期时候会出错，这个地方可以编写一个日期处理函数来解决这个问题。

7.使用go语言编写一个合成声音的接口

创建F5_TTSGenerateAudio.go文件

func GenerateTTSAudio(name, generateText string) ([]byte, error) {
	url := config.Conf.TTSVoiceModelUrl + "/generateAudio"

	// 从配置中获取参数
	refAudioPath := config.Conf.TTSVoiceModelSets[name].ReferAudioPath
	refText := config.Conf.TTSVoiceModelSets[name].ReferText

	// 创建一个缓冲区，用于存储多部分表单数据
	var requestBody bytes.Buffer
	writer := multipart.NewWriter(&requestBody)

	// 检查是否需要上传参考音频文件
	if refAudioPath != "" {
		file, err := os.Open(refAudioPath)
		if err != nil {
			return nil, fmt.Errorf("无法打开参考音频文件: %v", err)
		}
		defer file.Close()

		// 将参考音频文件添加到表单中
		fileWriter, err := writer.CreateFormFile("ref_audio", "ref_audio.wav")
		if err != nil {
			return nil, fmt.Errorf("无法创建表单文件: %v", err)
		}
		if _, err = io.Copy(fileWriter, file); err != nil {
			return nil, fmt.Errorf("无法将文件复制到表单文件中: %v", err)
		}
	} else {
		return nil, fmt.Errorf("参考音频路径未提供")
	}
	fmt.Println("refAudioPath: ", refAudioPath)

	// 添加参考音频路径到表单（如果接口需要）
	/*	if err := writer.WriteField("ref_audio_path", refAudioPath); err != nil {
			return nil, fmt.Errorf("无法添加参考音频路径到表单: %v", err)
		}
	*/
	// 添加参考文本到表单
	if err := writer.WriteField("ref_text", refText); err != nil {
		return nil, fmt.Errorf("无法添加参考文本到表单: %v", err)
	}
	fmt.Println("ref_text: ", refText)
	// 添加需要生成音频的文本到表单
	if err := writer.WriteField("gen_text", generateText); err != nil {
		return nil, fmt.Errorf("无法添加生成文本到表单: %v", err)
	}

	// 关闭 writer，以完成表单
	if err := writer.Close(); err != nil {
		return nil, fmt.Errorf("无法关闭 writer: %v", err)
	}

	// 创建 HTTP POST 请求
	request, err := http.NewRequest("POST", url, &requestBody)
	if err != nil {
		return nil, fmt.Errorf("无法创建 HTTP 请求: %v", err)
	}
	request.Header.Set("Content-Type", writer.FormDataContentType())

	// 执行 HTTP 请求
	client := &http.Client{}
	response, err := client.Do(request)
	if err != nil {
		return nil, fmt.Errorf("无法执行 HTTP 请求: %v", err)
	}
	defer response.Body.Close()

	// 检查 HTTP 响应状态是否正常
	if response.StatusCode != http.StatusOK {
		return nil, fmt.Errorf("HTTP 状态码异常: %s", response.Status)
	}

	// 读取响应主体（音频字节流）
	audioData, err := io.ReadAll(response.Body)
	if err != nil {
		return nil, fmt.Errorf("无法读取响应主体: %v", err)
	}

	return audioData, nil
}

使用gin框架封装一个tts声音合成的接口，新增tts声音，获取所有tts声音模型的名称的接口

// 请求体结构
type TTSRequest struct {
	Name         string `json:"name" binding:"required"`          // 声音名称
	GenerateText string `json:"generate_text" binding:"required"` // 需要合成的文本
}

// 响应体结构
type TTSResponse struct {
	Message string `json:"message"` // 状态信息
	Audio   []byte `json:"audio"`   // 生成的音频字节流（可选：返回音频文件路径或其他标识）
}

// 合成音频的接口函数
func TTSHandler(c *gin.Context) {
	/*
	   判断用户是否登录
	*/
	_, isLogin := IsUserLoggedIn(c)
	if !isLogin {
		log.Println("用户未登录")
		c.JSON(http.StatusOK, gin.H{"code": 0, "message": "用户未登录"})
		return
	}
	var req TTSRequest

	// 解析 JSON 请求体
	if err := c.ShouldBindJSON(&req); err != nil {
		c.JSON(http.StatusBadRequest, gin.H{"error": "无效的请求参数: " + err.Error()})
		return
	}

	// 调用 GenerateTTSAudio 函数生成音频
	audioData, err := GenerateTTSAudio(req.Name, req.GenerateText)
	if err != nil {
		c.JSON(http.StatusInternalServerError, gin.H{"error": "生成音频失败: " + err.Error()})
		return
	}

	// 设置响应头，返回音频
	c.Data(http.StatusOK, "audio/wav", audioData)
}
// 设置TTS声音
func SetTTSVoiceUploadFile(c *gin.Context) {
	/*
	   判断用户是否登录
	*/
	_, isLogin := IsUserLoggedIn(c)
	if !isLogin {
		log.Println("用户未登录")
		c.JSON(http.StatusOK, gin.H{"code": 0, "message": "用户未登录"})
		return
	}
	// 获取音频名称
	voiceName := c.PostForm("voiceName")
	if voiceName == "" {
		c.JSON(http.StatusBadRequest, gin.H{"error": "音频名称是必需的"})
		return
	}

	// 获取参考文本
	referText := c.PostForm("referText")
	if referText == "" {
		c.JSON(http.StatusBadRequest, gin.H{"error": "参考文本是必需的"})
		return
	}

	// 文件处理
	file, err := c.FormFile("referAudioFile")
	if err != nil {
		c.JSON(http.StatusBadRequest, gin.H{"error": "读取文件失败: " + err.Error()})
		return
	}
	save_path := "/psycheEpic/tts_refer_audio/"
	//save_path :="./tts_refer_audio/"
	// 确定文件保存路径
	savePath := filepath.Join(save_path, file.Filename)
	if err := c.SaveUploadedFile(file, savePath); err != nil {
		c.JSON(http.StatusInternalServerError, gin.H{"error": "保存文件失败: " + err.Error()})
		return
	}

	// 调用 addTTSVoiceModel 函数
	result, err := config.SetTTSVoiceModelConfig(voiceName, savePath, referText)
	if err != nil {
		os.Remove(savePath) // 如果配置更新失败，删除已保存的文件
		c.JSON(http.StatusInternalServerError, gin.H{"error": "更新配置失败: " + err.Error()})
		return
	}

	if result.Code == 200 {
		// 返回成功消息
		c.JSON(http.StatusOK, gin.H{"code": 200, "message": "设置TTS音频成功"})
	}

}

//获取所有tts声音模型的名称
func QueryAllTTSVoiceModelNames(ctx *gin.Context) {
	/*
	   判断用户是否登录
	*/
	_, isLogin := IsUserLoggedIn(ctx)
	if !isLogin {
		log.Println("用户未登录")
		ctx.JSON(http.StatusOK, gin.H{"code": 0, "message": "用户未登录"})
		return
	}
	// Reload configuration before querying
	if err := config.ReloadConfig(); err != nil {
		ctx.JSON(http.StatusInternalServerError, gin.H{"code": 0, "message": "无法加载配置: " + err.Error()})
		return
	}
	// 获取所有tts声音模型的名称（key）
	var ttsVoiceModelNames []string

	tts_vms := config.Conf.TTSVoiceModelSets
	for name, _ := range tts_vms {
		//fmt.Println("name:", name, "==> 中文", model.CnName)
		ttsVoiceModelNames = append(ttsVoiceModelNames, name)
	}
	ctx.JSON(http.StatusOK, gin.H{
		"code":    1,
		"message": "查询所有tts声音模型名称成功",
		"names":   ttsVoiceModelNames,
	})

}

路由中定义路由

func InitRouter() *gin.Engine {

	router := gin.Default()
//F5-TTS 接口
	router.POST("/GenerateAudioTTS", service.TTSHandler)
    
	return router

}

定义配置config程序，定义设置tts语音模型参数的函数，动态写入配置信息到json.config的函数

package config

import (
	"encoding/binary"
	"encoding/json"
	"fmt"
	"github.com/patrickmn/go-cache"
	"io/ioutil"
	"math/rand"
	"os"
	"path/filepath"
	"time"
)

var (
	Conf           *AppConfig
	LoginCacheCode = cache.New(30*time.Minute, 15*time.Second)
)
type AppConfig struct {
	AppName string `json:"app_name"` //项目名称 no-name

	DBname   string `json:"db_name"`   //数据库名称 test_schema
	DBserver string `json:"db_server"` //mysql域名

	Mode           string `json:"mode"`
	Mysql_UserName string `json:"mysql_username"` //mysql用户名 root
	Mysql_PWD      string `json:"mysql_pwd"`      //mysql密码 root
	MysqlPort      string `json:"mysql_port"`     //mysql启动端口
TTSVoiceModelSets map[string]TTSVoiceModel `json:"tts_voice_model_sets"`
    
    }

// 定义 TTSVoiceModel 结构体来存储每个模型的路径
type TTSVoiceModel struct {
	ReferAudioPath string `json:"refer_audio_path"`
	ReferText      string `json:"refer_text"`
}
// 初始化配置的函数
func InitConfig() *AppConfig {
	fmt.Println(" 读取配置文件... ")
	file, err := os.Open("./config.json")
	/*
		 var file *os.File
			var err error
			if runtime.GOOS == "linux" {
				file, err = os.Open("./config.json")
			} else {
				file, err = os.Open("src/config.json")
			}
	*/
	if err != nil {
		println("error is :", err.Error())
	}

	decoder := json.NewDecoder(file)

	conf := AppConfig{}

	err = decoder.Decode(&conf)
	if err != nil {

		println("error is :", err.Error())

	}
	Conf = &conf
	return &conf

}
// 重新加载配置文件，防止添加了一个声音模型，但是缓存中的配置文件没有更新
func ReloadConfig() error {
	configFile, err := os.Open("./config.json")
	if err != nil {
		return err
	}
	defer configFile.Close()

	decoder := json.NewDecoder(configFile)
	err = decoder.Decode(&Conf)
	if err != nil {
		return err
	}
	return nil
}

type ResponseResult struct {
	Code    int    `json:"code"`
	Message string `json:"message"`
}
// 设置tts语音模型参数的函数，动态写入配置信息到json.config
func SetTTSVoiceModelConfig(voiceName, audioPath, referText string) (*ResponseResult, error) {
	if voiceName == "" || audioPath == "" || referText == "" {
		return nil, fmt.Errorf("错误 402: 缺少必要参数")
	}

	configFilePath := "./config.json"

	// Open the existing configuration file.
	configFile, err := os.Open(configFilePath)
	if err != nil {
		return nil, fmt.Errorf("错误 500: 无法打开配置文件: %v", err)
	}
	defer configFile.Close()

	decoder := json.NewDecoder(configFile)
	config := AppConfig{}
	if err = decoder.Decode(&config); err != nil {
		return nil, fmt.Errorf("错误 401: 参数解析错误: %v", err)
	}
	// 调整保存的文件路径，确保它是相对于当前工作目录的相对路径
	//relativePath, err := filepath.Rel(".", audioPath)
	//if err != nil {
	//	return nil, fmt.Errorf("错误 500: 生成相对路径失败: %v", err)
	//}
	// 确保 audioPath 是绝对路径
	absolutePath, err := filepath.Abs(audioPath)
	if err != nil {
		return nil, fmt.Errorf("错误 500: 获取绝对路径失败: %v", err)
	}

	// Add or update the TTS voice model.
	newModel := TTSVoiceModel{
		ReferAudioPath: absolutePath,
		ReferText:      referText,
	}
	if config.TTSVoiceModelSets == nil {
		config.TTSVoiceModelSets = make(map[string]TTSVoiceModel)
	}
	config.TTSVoiceModelSets[voiceName] = newModel

	// Write the updated configuration back to the file.
	configBytes, err := json.MarshalIndent(config, "", "  ")
	if err != nil {
		return nil, fmt.Errorf("错误 400: JSON格式化错误: %v", err)
	}

	if err = ioutil.WriteFile(configFilePath, configBytes, 0644); err != nil {
		return nil, fmt.Errorf("错误 500: 写入配置文件失败: %v", err)
	}

	return &ResponseResult{Code: 200, Message: "配置添加成功"}, nil
}

config.json

{
  "app_name": "no-name",
  "db_name": "HuaSoul",
  "db_server": "192.168.191.101",
  "mode": "dev",
  "mysql_username": "mind",
  "mysql_pwd": "mind",
  "mysql_port": "13679",
  "mongo_username": "Mind",
  "mongo_pwd": "123456",
  "mongo_host": "192.168.191.101",
  "mongo_port": "27017",
  "port": "8089",
  "static_path": "/static",
  "timeout10": "10s",
  "redis_server": "192.168.191.101",
  "redis_port": "6379",
  "redis_pwd": "123456",
  "redis_db": "0",
  "tts_voice_model_sets": {
    "MaYun": {
      "refer_audio_path": "D:\\psycheEpic\\tts_refer_audio\\mayun4s.wav",
      "refer_text": "很多人因为看见而相信，但是我们这些人。"
    },
    "女声1": {
      "refer_audio_path": "/psycheEpic/tts_refer_audio/huainvren_1.wav",
      "refer_text": "什么意思啊？什么意思呀？他不播了是吧？"
    }
  }
}