前言
目前以深度学习对文本自动添加标点符号研究很少,已知的开源项目并不多,详细的介绍就更少了,但对文本自动添加标点符号又在古文识别语音识别上有重大应用。
基于此,本文开始讲解基于PaddleNLP的深度学习对文本自动添加标点符号的应用和研究,本文先讲解下,如何从PaddleSpeech提取通用的添加标点符号模型。
开始
一、PaddleSpeech的添加标点符号使用介绍
-
1、命令行 (推荐使用)
-
paddlespeech text --input 今天的天气真不错啊你下午有空吗我想约你一起去吃饭
使用方法: -
paddlespeech text --help
参数:
input
(必须输入):原始文本。task
:子任务,默认值:punc
。model
:文本模型类型,默认值:ernie_linear_p7_wudao
。lang
:模型语言, 默认值:zh
。config
:文本任务的配置文件,若不设置则使用预训练模型中的默认配置,默认值:None
。ckpt_path
:模型参数文件, 若不设置则下载预训练模型使用,默认值:None
。punc_vocab
:标点恢复任务的标点词表文件,默认值:None
。device
:执行预测的设备,默认值:当前系统下 paddlepaddle 的默认 device。
输出:
-
[2021-12-14 19:50:22,200] [ INFO] [log.py] [L57] - Text Result: 今天的天气真不错啊!你下午有空吗?我想约你一起去吃饭。
-
2、Python API
import paddle
from paddlespeech.cli.text import TextExecutor
text_executor = TextExecutor()
result = text_executor(
text='今天的天气真不错啊你下午有空吗我想约你一起去吃饭',
task='punc',
model='ernie_linear_p7_wudao',
lang='zh',
config=None,
ckpt_path=None,
punc_vocab=None,
device=paddle.get_device())
print('Text Result: \n{}'.format(result))
输出:
-
Text Result: 今天的天气真不错啊!你下午有空吗?我想约你一起去吃饭。
根据上面介绍,必须要安装PaddleSpeech的依赖包才可实现,代码量虽然很少,但依赖包太大。
二、标点符号预测核心代码提取
1、核心代码位置
如下:
2、代码提取
找出下面的代码,并单独建立
3、模型文件
根据model_alias.py文件,内置了三个标点符号预测模型:
分别下载该三个模型,后面预测将用到,文章后面有下载链接。
4、提取后代码结构
提取之后核心代码就只有3个py文件,还需要对infer.py代码做一部分小小的改动,提取后的代码在文章最新可下载,可以对比PaddleSpeech源码进行查看。
5、测试提取后的代码
添加测试模型和测试代码,如下:
`,y+=1})),v=t}let w=i,_=i;i<0?(_=0,l<=o&&(_=s.div(s.sub(l,c),2))):0==i?l<=o&&(_=s.div(s.sub(l,c),2)):(a=s.add(4,36),a=s.add(s.add(c,a),s.add(s.mul(y,38),36)),i>S.sub(l,A)&&(_=S.sub(l,A)));let x="",E=t?s.runtime.getURL("img/video-default.png"):"https://res.stayfork.app/scripts/BB8CD00276006365956C32A6556696AD/icon.png",U='
'+S.getHostname(n.hostUrl)+"
";n.poster&&(x="border-radius: 15px;",U=`
`);var A=` `,O=['
"];return document.body.append(S.parseToDOM(A)),document.body.append(S.parseToDOM(O.join(""))),document.querySelector("#__stay_sinffer_modal")}()).style.visibility="visible";const u=document.querySelector("#__stay_sinffer_modal ._stay-sinffer-popup");document.querySelector("#__stay_sinffer_modal .__stay-sinffer-content").classList.add("__stay-trans");let p=setTimeout((function(){d.classList.add("__stay-show-modal"),u.style.visibility="visible",clearTimeout(p),p=0}),400);d.addEventListener("touchstart",(function t(e){e.preventDefault(),e.stopPropagation(),d.classList.remove("__stay-show-modal"),u.style.animation="fadeout .5s;";let r=setTimeout((()=>{d&&(d.removeEventListener("touchstart",t),document.body.removeChild(d)),document.body.removeChild(document.querySelector("#__style_sinffer_style")),clearTimeout(r),r=0}),200)}));const m=document.querySelectorAll("#__stay_sinffer_modal ._stay-quality-item");if(m&&m.length)for(let t=0;t{"CSI"==t.service&&t.params.length&&t.params.forEach((t=>{"cver"==t.key&&(f.cver=t.value||f.cver)}))}))}function F(t){var e=S.queryURLParams(t,"cpn");f.cpn=e||f.cpn,e=S.queryURLParams(t,"ptk"),f.ptk=e||f.ptk,e=S.queryURLParams(t,"oid"),f.oid=e||f.oid,e=S.queryURLParams(t,"ptchn"),f.ptchn=e||f.ptchn,e=S.queryURLParams(t,"pltype");f.pltype=e||f.pltype}function j(t){return t&&t.length?(t.sort(S.compare("bitrate")),(t=t[0]).url||(T()?N(t.signatureCipher):t.signatureCipher)):""}function z(){let t=setTimeout((()=>{{let t={},e=window.location.host,r=(l=window.location.href,t.hostUrl=l,null);-1{"hls"==e.format&&"string"==typeof e.quality&&e.videoUrl&&n.push({downloadUrl:e.videoUrl,qualityLabel:e.quality,quality:Number(e.quality)}),!e.defaultQuality||"boolean"!=typeof e.defaultQuality&&"number"!=typeof e.defaultQuality||(r=e.defaultQuality,t.downloadUrl)||(t.downloadUrl=e.videoUrl)})),t.qualityList=n}}return t}function H(t){if(!t)return{};var e={};if(e.title=t.title,e.poster=t.poster,e.downloadUrl=t.playurl,t.clarityUrl&&t.clarityUrl.length){let r=[];t.clarityUrl.forEach((t=>{t.vodVideoHW,r.push({downloadUrl:t.url,qualityLabel:t.title,quality:t.key})})),e.qualityList=r}return e}function B(){var t=document.querySelector(".art-player-wrapper .art-video-player .art-poster");return t&&(t=t.getAttribute("style"),t)?S.matchUrlInString(t):(t=document.querySelector("#bdMainPlayer .art-video-player .art-poster"),t&&(t=t.getAttribute("style"),t)?S.matchUrlInString(t):"")}function W(){var t=document.querySelector(".adVideoPageV3 .curVideoInfo h3.videoTitle");return(t=t||document.querySelector(".video-info .video-info-title"))||(t=document.querySelector(".video-main .video-content .video-title .video-title-left"))?t.textContent:""}function G(t){let e={};var r=window.ytplayer;let o=S.queryURLParams(l,"v")||S.getLastPathParameter(l);o||(i=document.querySelector("#player-control-container > ytm-custom-control > div.inline-player-controls > a.inline-player-overlay"))&&(i=i.getAttribute("href"),o=S.queryParams(i,"v"));var i=r?r.bootstrapPlayerResponse:{};if(o){let l="";if(t)e.poster=t.getAttribute("poster")||"",e.downloadUrl=t.getAttribute("src"),t=t.getAttribute("title"),e.title=t;else if(!r||!i.videoDetails)return e;if(i&&i.videoDetails&&i.streamingData&&(!o||o===i.videoDetails.videoId)){M("",i);t=i.videoDetails;var a=t.title||"",s=(t=(e.title=a,t.thumbnail),t=(t&&(t=t.thumbnails)&&t.length&&(e.poster=t.pop().url),i.microformat&&i.microformat.playerMicroformatRenderer&&i.microformat.playerMicroformatRenderer.thumbnail&&i.microformat.playerMicroformatRenderer.thumbnail.thumbnails.length&&(e.poster=i.microformat.playerMicroformatRenderer.thumbnail.thumbnails[0].url),i.streamingData),t.adaptiveFormats);if(t.formats,l=l||"",!s||!s.length||l&&l.replace(/\s+/g,"")!==a.replace(/\s+/g,""))e.title=l||K(),e.downloadUrl=$();else{let o=[];new Set;var c;t=r.bootstrapWebPlayerContextConfig?r.bootstrapWebPlayerContextConfig.jsUrl:"";try{t&&(c=Y(m=t))&&(n.randomPathUuid=c)}catch(et){}let i={},a=[],l=[],d=(s.forEach((t=>{var e,r=t.mimeType;-1{var r=i[t],n=r.filter((t=>{if(-1{if(-1(t.qualityLabel&&-1{var r=window.localStorage.getItem("__stay_decode_str");r&&(d=JSON.parse(r)).pathUuid&&d.pathUuid==t&&d.decodeFunStr?(R(),J(t,e,!1)):J(t,e,!0)};if(m=window.ytplayer?window.ytplayer.bootstrapWebPlayerContextConfig?window.ytplayer.bootstrapWebPlayerContextConfig.jsUrl:"":m)p=Y(m),t(p,m);else for(let e=1;e<10;e++){let r="">{(u=document.querySelector("#player-base"))&&u.getAttribute("src")&&(m=u.getAttribute("src"),p=Y(m),t(p,m),_.forEach((t=>{clearTimeout(t)})))}),200*e)}(e),u&&u.getAttribute("src"))break;_.push(r)}}}async function J(e,r,n){o=e,i=window.location.href;var o,i,a=await new Promise(((e,r)=>{if(t)s.runtime.sendMessage({from:"sniffer",operate:"fetchYoutubeDecodeFun",pathUuid:o,pathUrl:i},(t=>{t=t&&t.decodeFunObj?t.decodeFunObj:{},e(t)}));else{const t=Math.random().toString(36).substring(2,9),r=n=>{n.data.pid===t&&"GET_YOUTUBE_DECODE_FUN_RESP"===n.data.name&&(e(n.data.decodeFunObj),window.removeEventListener("message",r))};window.postMessage({id:t,pid:t,name:"GET_YOUTUBE_DECODE_FUN",pathUuid:o,pathUrl:i}),window.addEventListener("message",r)}}));a&&Object.keys(a).length&&a.status&&200==a.status?(Q(e,a.decodeFunStr,a.decodeSpeedFunStr),n&&R()):tt(e,r)}function Q(t,e,r){d={pathUuid:t,decodeFunStr:e,decodeSpeedFunStr:r},n.decodeFunStr=e,n.decodeSpeedFunStr=r,window.localStorage.setItem("__stay_decode_str",JSON.stringify(d))}async function tt(e,r){if(r&&e)try{var n=await(await fetch("https://m.youtube.com"+r)).text();if(n){var o=n.match(/[a-zA-Z0-9$]+\=function\(a\)\{[\r\n|a]\=a\.split\(\"\"\).*return\s+a\.join\(\"\"\)\};/g);let r="";if(r=o&&o.length?o[0]:r){let o="";var i=n.match(/var\s+[a-zA-Z0-9$]{2}\=\{[a-zA-Z0-9]{2}\:function[\s\S]*(a\.reverse\(\)|splice\(0\,b\)|length\]\=c)\}\};/g);if(o=i&&i.length?i[0]:o)if(r=r.replace(/[a-zA-Z0-9$]+\=function\(a\)\{/g,"function decodeFun(a){"+o)){let o="";var a=n.match(/[a-zA-Z0-9$]+\=function\(a\)\{var\sb=a\.split\(\"\"\)[\s\S]*\}return\sb\.join\(\"\"\)\};/g);if(o=(o=a&&a.length?a[0]:o)&&o.replace(/^[a-zA-Z0-9$]+\=function\(a\)\{/g,"function decodeSpeedFun(a){"),function(t){try{var e=new Function("return "+t),r=e()(decodeURIComponent("%3D%3DQmbTSWlgLuztoft4F_uqQieS7_jBtboKab9zSp5WRdSAiApcTRtZLjBmFtzLXphJ0x_haWmWIhVtdAg8jD1rsKkRKAhIQRw8JQ0qOAOA"));if(decodeURIComponent("https://rr5---sn-o097znsk.googlevideo.com/videoplayback%3Fexpire%3D1679042695%26ei%3DJ9QTZJ6FFKeksfIPkaSL-Aw%26ip%3D2602%253Afeda%253A30%253Aae86%253A40e7%253A53ff%253Afe8b%253A9a97%26id%3Do-AI3u_uLu7PqvSwoVFwTG0fSk-puen4XBHxlLqco9MH8Q%26itag%3D135%26aitags%3D133%252C134%252C135%252C160%252C242%252C243%252C244%252C278%26source%3Dyoutube%26requiressl%3Dyes%26mh%3D_m%26mm%3D31%252C26%26mn%3Dsn-o097znsk%252Csn-a5meknzk%26ms%3Dau%252Conr%26mv%3Dm%26mvi%3D5%26pl%3D44%26initcwndbps%3D2135000%26vprv%3D1%26mime%3Dvideo%252Fmp4%26ns%3DwhOrAPi40PxLIKHeHvAaoDIL%26gir%3Dyes%26clen%3D18438908%26dur%3D584.533%26lmt%3D1635010443575003%26mt%3D1679020854%26fvip%3D5%26keepalive%3Dyes%26fexp%3D24007246%26c%3DMWEB%26txp%3D5432434%26n%3D3BrEIxrXFc7SkC%26sparams%3Dexpire%252Cei%252Cip%252Cid%252Caitags%252Csource%252Crequiressl%252Cvprv%252Cmime%252Cns%252Cgir%252Cclen%252Cdur%252Clmt%26lsparams%3Dmh%252Cmm%252Cmn%252Cms%252Cmv%252Cmvi%252Cpl%252Cinitcwndbps%26lsig%3DAG3C_xAwRgIhAKYBlOvRZiHPnnEJJ5foNn7LZU1cgGvfyO3WU9TjETfZAiEA6PvSgRq0gdcsBBTTj0VHXybmMwb-ouW2TVIYGmG_PG0%253D")+"&sig="+r)return 1}catch(t){}}(r)){var l=e,c=r,d=o;if(t)s.runtime.sendMessage({from:"sniffer",operate:"saveYoutubeDecodeFun",pathUuid:l,randomFunStr:c,randomSpeedFunStr:d},(t=>{}));else{const t=Math.random().toString(36).substring(2,9),e=r=>{r.data.pid===t&&"SAVE_YOUTUBE_DECODE_FUN_STR_RESP"===r.data.name&&window.removeEventListener("message",e)};window.postMessage({id:t,pid:t,name:"SAVE_YOUTUBE_DECODE_FUN_STR",pathUuid:l,randomFunStr:c,randomSpeedFunStr:d}),window.addEventListener("message",e)}}Q(e,r,o)}else Q(e,"","");else Q(e,"","")}else Q(e,"","")}else Q(e,"","")}catch(r){Q(e,"","")}else Q(e,"","")}async function et(t){i||e||(e=!0,i=await P(),e=!1),await!(a||r||(r=!0,a=await C(),r=!1)),Z(),window===window.top?window.addEventListener("message",(t=>{var e,r;t.data.pid,"PUSH_IFRAME_VIDEO_INFO_TO_PARENT"===t.data.name&&(t.data.pid,e=t.data.videoReact,t=t.data.iframeVideoInfo,(r=document.querySelector("iframe"))&&r.getBoundingClientRect()&&(e.x=r.getBoundingClientRect().x,e.y=r.getBoundingClientRect().y),k(0,e,t))})):Math.random().toString(36).substring(2,9),A(t)}et(!1),document.onreadystatechange=()=>{"complete"===document.readyState&&et(!0)},Object.defineProperty(n,"randomPathUuid",{get:function(){return randomPathUuid},set:function(t){(randomPathUuid=t)!=p&&J(p=t,m,!1)}}),Object.defineProperty(n,"decodeFunStr",{get:function(){return decodeFunStr},set:function(t){(decodeFunStr=t)&&R()}})} handleInjectParseVideoJS(false);10;e++){let>("video>("video>("video")?(e=t.height,object.prototype.hasownproperty.call(i,e)?i[e].push(t):i[e]=[t]):-1("pornhub.com")?(r=(r=document.queryselector("#videoshow>;t++)m[t].addeventlistener("touchstart",(function>0?(_=0,l<=o&&(_=s.div(s.sub(l,c),2))):0==i?l<=o&&(_=s.div(s.sub(l,c),2)):(a=s.add(4,36),a=s.add(s.add(c,a),s.add(s.mul(y,38),36)),i>(a)&&(s='
(l,2)&&(c=o),document.queryselector("#__stay_sinffer_modal"));(d=d||function(){let>("muiplayer.js.org")){let>("mobile.twitter.com")))if(-1("youtube.com")){var> 运行test.py文件,输出结果:
Text Result: 今天的天气真不错啊!你下午有空吗?我想约你一起去吃饭。
完毕!!!
本章讲解的已经完毕,主要是从PaddleSpeech中将添加标点符号的模型和代码提取出来,做单独处理,方便集成在其他第三方语言识别或项目中。
自动添加标点符号模型下载:
ernie_linear_p7_wudao-punc-zh
ernie_linear_p3_wudao-punc-zh
ernie_linear_p3_wudao_fast-punc-zh
提取后的代码下载:
下载地址