wsj0数据集原始文件.wv1.wv2转换成wav文件

news2026/2/13 21:30:35

文章目录

- - 准备
  - 一、获取WSJO数据集
  - 二、安装sph2pipe
  - 三、转换代码
  - 四、结果展示

最近做语音分离实验需要用到wsj0-2mix数据集，但是从李宏毅语音分离教程里面获取的wsj0-2mix只有一部分。从网上获取到了完整的WSJO数据集后，由于原始的语音文件后缀是wv1或者wv2，创建wsj0-2mix需要wav文件，所以需要通过工具进行转换。网上的教程使用后只能生成一堆空文件夹，无法生成转换后的wav文件。因此，在成功解决这个问题后，把采用的方法分享一下，按照下列步骤能完美进行转换。

准备

平台：Windows
工具：
- python
- sph2pipe
数据集：WSJO

一、获取WSJO数据集

官网可以申请该数据集，想要的可以私信或者发加q3280461976

二、安装sph2pipe

大家按照之前网上的方法只能生成一堆空文件夹就是因为没有安装sph2pipe。按照下列步骤操作

下载sph2pipe(https://www.ldc.upenn.edu/language-resources/tools/sphere-conversion-tools)，选择版本2.5，2.1在Windows下已经无法运行

配置环境变量，如下将sph2pipe.exe的路径加入你电脑的系统变量Path里面，
找到你下载的文件，可以看到里面有一个exe文件

验证sph2pipe能否成功运行，首先运行cmd或者powershell，保证当前目录里面含有sph2pipe.exe文件，输入sph2pipe，有如下提示就可以了

下面就可以使用sph2pipe进行类型转换

三、转换代码

下面就是转换代码，详细说一下你需要改哪些地方，特别是第三点，非常重要！！！

root_dir：改成你下载的wsj0的根目录文件，代码里面有例子
my_path：改成你想要保存转换后wav文件存储的路径
cmd = "E: \sph2pipe_v2.5.tar \sph2pipe_v2.5\sph2pipe -f wav " + speech_dir + " " + target_dir：把E: \sph2pipe_v2.5.tar \sph2pipe_v2.5改成你sph2pipe.exe存在的路径，注意代码中是“\ \”（双斜杠），

"""
# example:
# 11-1.1/wsj0/si_tr_s/01t/01to030v.wv1 is converted to wav and
# stored in YOUR_PATH/wsj0/si_tr_s/01t/01to030v.wav
"""
import os

# 你下载的wsj0的根目录 例子：E:\\csr_1_comp_LDC93S6A\\csr_1_comp，
root_dir = ""

# the disc number
disc_dir = []
for list_disc in os.listdir(root_dir):
    if list_disc not in ["text", "11-13.1"]: #doc file and 11-13.1 file do not contain .wv files
        # the data dir for each disc
        disc_dir.append(os.path.join(root_dir, list_disc, "wsj0"))

# 转换后的文件想要保存的位置
my_path = ""
if not os.path.exists(my_path):
    os.mkdir(my_path)
# # the sub_data dir for each disc
for i, list_sub_data in enumerate(disc_dir):
    for sub_data_dir in os.listdir(list_sub_data):
        if (not sub_data_dir.startswith("si")) and (not sub_data_dir.startswith("sd")):
            continue
        s_dir = os.path.join(my_path, sub_data_dir)
        if not os.path.exists(s_dir):
            os.mkdir(s_dir)
        if sub_data_dir[0][0] == 's':
            datatype_dir = os.path.join(list_sub_data, sub_data_dir)
            for list_spk in os.listdir(datatype_dir):
                spk_dir = os.path.join(s_dir, list_spk)
                spk_dir_abs = os.path.join(datatype_dir, list_spk)
                if not os.path.exists(spk_dir):
                    os.mkdir(spk_dir)
                for wv_file in os.listdir(spk_dir_abs):
                    if (not wv_file.endswith('.wv1')) and (not wv_file.endswith('.wv2')):
                        continue
                    speech_dir = os.path.join(spk_dir_abs, wv_file)
                    if wv_file.split('.')[1] == "wv1":
                        target_name = wv_file.split(sep='.')[0] + '.wav'
                    elif wv_file.split('.')[1] == 'wv2':
                        target_name = wv_file.split(sep='.')[0] + '_1.wav'
                    target_dir = spk_dir + '\\' + target_name
                    # 一定要注意！！！ sph2pipe -f wav前面的路径必须包含上面讲到的sph2pipe.exe,你只需要根据你sph2pipe.exe存放的位置修改这段路径：E:\\sph2pipe_v2.5.tar\\sph2pipe_v2.5
                    cmd = "E:\\sph2pipe_v2.5.tar\\sph2pipe_v2.5\\sph2pipe -f wav " + speech_dir + " " + target_dir
                    os.system(cmd)