文章目录
- 一、安装 rpunct
- 二、使用
- 三、下载模型时报错
- 1、报错详情
- 2、报错原因
- 3、解决方案
- 四、程序运行时报错
- 1、报错详情
- 2、报错原因
- 3、解决方案
- 五、修改默认缓存路径
一、安装 rpunct
pip install rpunct
相关依赖包信息:
langdetect==1.0.9
pandas==1.2.4
simpletransformers==0.61.4
six==1.16.0
torch==1.8.1
github链接:https://github.com/Felflare/rpunct
二、使用
字符串方式测试:
from rpunct import RestorePuncts
def main():
rpunct = RestorePuncts()
text = rpunct.punctuate("""in 2018 cornell researchers built a high-powered detector that in combination with an algorithm-driven process called ptychography set a world record
by tripling the resolution of a state-of-the-art electron microscope as successful as it was that approach had a weakness it only worked with ultrathin samples that were
a few atoms thick anything thicker would cause the electrons to scatter in ways that could not be disentangled now a team again led by david muller the samuel b eckert
professor of engineering has bested its own record by a factor of two with an electron microscope pixel array detector empad that incorporates even more sophisticated
3d reconstruction algorithms the resolution is so fine-tuned the only blurring that remains is the thermal jiggling of the atoms themselves""")
print(text)
if __name__ == "__main__":
main()
注:只支持英文文本。程序中务必添加 if __name__ == "__main__"
,否则将报错。
文件读取方式:
rpunct = RestorePuncts()
with open('./upload/text/eng.txt','r',encoding='utf-8') as f:
text = f.read()
output_text = rpunct.punctuate(text)
print(output_text)
注:测试过程发现没有指定 utf-8
编码时,输出的文本会出现乱码。
预期效果:
In 2018, Cornell researchers built a high-powered detector that, in combination with an algorithm-driven process called Ptychography, set a world record by tripling the
resolution of a state-of-the-art electron microscope. As successful as it was, that approach had a weakness. It only worked with ultrathin samples that were a few atoms
thick. Anything thicker would cause the electrons to scatter in ways that could not be disentangled. Now, a team again led by David Muller, the Samuel B.
Eckert Professor of Engineering, has bested its own record by a factor of two with an Electron microscope pixel array detector empad that incorporates even more
sophisticated 3d reconstruction algorithms. The resolution is so fine-tuned the only blurring that remains is the thermal jiggling of the atoms themselves.
三、下载模型时报错
1、报错详情
OSError: We couldn't connect to 'https://huggingface.co' to load this file,
couldn't find it in the cached files and it looks like felflare/bert-restore-punctuation is not the path to a directory containing a file named config.json.
Checkout your internet connection or see how to run the library in offline mode at 'https://huggingface.co/docs/transformers/installation#offline-mode'.
2、报错原因
目标网站无法访问,无法下载相关模型数据。
3、解决方案
方法一:
找到包安装目录 D:\Anaconda\envs\speech\Lib\site-packages\rpunct\punctuate.py
下的 punctuate.py
文件 。(应根据自身实际查找包安装目录)
也可以使用 Ctrl
+ 鼠标点击代码 from rpunct import RestorePuncts
中的 RestorePuncts
快捷跳转
将下载地址修改为以下镜像地址:
__author__ = "Daulet N."
...
import os
os.environ['HF_ENDPOINT'] = 'https://hf-mirror.com'
import logging
...
参考链接(评论区):https://zhuanlan.zhihu.com/p/627688602
方法二:
以离线方式加载模型,通过自行下载模型的方式,再填写对应模型路径。
找到包安装目录 D:\Anaconda\envs\speech\Lib\site-packages\rpunct\punctuate.py
下的 punctuate.py
文件 。(应根据自身实际查找包安装目录)
修改 punctuate.py
文件如下:
from transformers import AutoModel
class RestorePuncts:
def __init__(self, wrds_per_pred=250):
...
self.model = AutoModel.from_pretrained("D:/AnacondaCLI/cache/huggingface/hub/models--felflare--bert-restore-punctuation/snapshots/954108a105ef1f89f08b71c25d6e33bb89cde724")
注:在测试过程中发现可以离线加载模型,但程序却无法正常运行。不知是程序中方法更新不再支持还是路径填写方式错误导致的。
更多详情参见:https://huggingface.co/docs/transformers/installation#offline-mode
四、程序运行时报错
1、报错详情
ValueError: 'use_cuda' set to True when cuda is unavailable.Make sure CUDA is available or set use_cuda=False.
2、报错原因
CUDA
不可用,确保 CUDA
可用,或将 use_CUDA=False
设置为空。
3、解决方案
找到包安装路径下的 ner_model.py
文件 D:\Anaconda\envs\speech\Lib\site-packages\simpletransformers\ner\ner_model.py
。(应根据自身实际查找包安装目录)
快速定位:
也可以使用 Ctrl
+ 鼠标点击代码 from rpunct import RestorePuncts
中的 RestorePuncts
快捷跳转,再通过代码 from simpletransformers.ner import NERModel
中的 NERModel
进行跳转。
Ctrl
+ g
快捷键输入 114
快速定位。
修改 ner_model.py
文件参数:
class NERModel:
def __init__(
...
# use_cuda=True,
# 将参数修改为 False
use_cuda=False,
...
):
参考链接:https://github.com/Felflare/rpunct/issues/1
五、修改默认缓存路径
默认缓存路径(Windows):C:\Users\username\.cache\huggingface\hub
添加环境变量:HUGGINGFACE_HUB_CACHE
具体路径自行更改:
更多详情参见:https://huggingface.co/docs/transformers/installation#install-with-conda
因依赖包版本问题无法安装 rpunct
可参见:https://github.com/samwaterbury/rpunct