20230508在Ubuntu22.04下使用python3批量转换DOCX文档为TXT
2023/5/8 16:27
在WIN10下请参考本文,在Ubuntu22.04下需要不通的插件!
https://blog.csdn.net/weixin_46255747/article/details/129961988
python实现批量docx转txt
docx文档放到input目录中。
docx文档转txt之后的文档放到output目录中。
本文分3个步骤:
1、遍历input目录中的全部docx档。
2、docx档转txt档。
3、TXT档保存在output目录中。
0、python3的插件安装:
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$ ll *.py
-rwxr--r-- 1 rootroot rootroot 1245 5月 7 20:07 pdf2doc2.py*
-rwxr--r-- 1 rootroot rootroot 1245 5月 7 20:07 'pdf2doc2 - 副本.py'*
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$ python pdf2doc2.py
Traceback (most recent call last):
File "pdf2doc2.py", line 3, in <module>
from pdf2docx import Converter
ImportError: No module named pdf2docx
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$ python3 pdf2doc2.py
Traceback (most recent call last):
File "/home/rootroot/pdf2doc2.py", line 3, in <module>
from pdf2docx import Converter
ModuleNotFoundError: No module named 'pdf2docx'
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$ pip install pdf2docx
Defaulting to user installation because normal site-packages is not writeable
Collecting pdf2docx
Downloading pdf2docx-0.5.6-py3-none-any.whl (148 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 148.4/148.4 KB 475.7 kB/s eta 0:00:00
Collecting opencv-python>=4.5
Downloading opencv_python-4.7.0.72-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (61.8 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61.8/61.8 MB 7.8 MB/s eta 0:00:00
Collecting fonttools>=4.24.0
Downloading fonttools-4.39.3-py3-none-any.whl (1.0 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.0/1.0 MB 14.9 MB/s eta 0:00:00
Requirement already satisfied: numpy>=1.17.2 in ./.local/lib/python3.10/site-packages (from pdf2docx) (1.23.5)
Collecting PyMuPDF>=1.19.0
Downloading PyMuPDF-1.22.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (14.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.1/14.1 MB 8.5 MB/s eta 0:00:00
Collecting python-docx>=0.8.10
Downloading python-docx-0.8.11.tar.gz (5.6 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.6/5.6 MB 9.7 MB/s eta 0:00:00
Preparing metadata (setup.py) ... done
Collecting fire>=0.3.0
Downloading fire-0.5.0.tar.gz (88 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 88.3/88.3 KB 3.1 MB/s eta 0:00:00
Preparing metadata (setup.py) ... done
Requirement already satisfied: six in /usr/lib/python3/dist-packages (from fire>=0.3.0->pdf2docx) (1.16.0)
Collecting termcolor
Downloading termcolor-2.3.0-py3-none-any.whl (6.9 kB)
Collecting lxml>=2.3.2
Downloading lxml-4.9.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (7.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.1/7.1 MB 5.6 MB/s eta 0:00:00
Building wheels for collected packages: fire, python-docx
Building wheel for fire (setup.py) ... done
Created wheel for fire: filename=fire-0.5.0-py2.py3-none-any.whl size=116951 sha256=94694033221a75c7c45708f1b1f670d3656b47aa32ecdc45d8c6442cdf8541ab
Stored in directory: /home/rootroot/.cache/pip/wheels/90/d4/f7/9404e5db0116bd4d43e5666eaa3e70ab53723e1e3ea40c9a95
Building wheel for python-docx (setup.py) ... done
Created wheel for python-docx: filename=python_docx-0.8.11-py3-none-any.whl size=184507 sha256=4bf244d8f5006e4c3bf7c9c5990a1731cfce7544749130e0f13997e02f44aa1d
Stored in directory: /home/rootroot/.cache/pip/wheels/80/27/06/837436d4c3bd989b957a91679966f207bfd71d358d63a8194d
Successfully built fire python-docx
Installing collected packages: termcolor, PyMuPDF, opencv-python, lxml, fonttools, python-docx, fire, pdf2docx
Successfully installed PyMuPDF-1.22.2 fire-0.5.0 fonttools-4.39.3 lxml-4.9.2 opencv-python-4.7.0.72 pdf2docx-0.5.6 python-docx-0.8.11 termcolor-2.3.0
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$ pip install win32com
Defaulting to user installation because normal site-packages is not writeable
ERROR: Could not find a version that satisfies the requirement win32com (from versions: none)
ERROR: No matching distribution found for win32com
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$ pip install pypiwin32
Defaulting to user installation because normal site-packages is not writeable
Collecting pypiwin32
Downloading pypiwin32-223-py3-none-any.whl (1.7 kB)
Downloading pypiwin32-219.zip (4.8 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.8/4.8 MB 4.0 MB/s eta 0:00:00
Preparing metadata (setup.py) ... error
error: subprocess-exited-with-error
× python setup.py egg_info did not run successfully.
│ exit code: 1
╰─> [7 lines of output]
Traceback (most recent call last):
File "<string>", line 2, in <module>
File "<pip-setuptools-caller>", line 34, in <module>
File "/tmp/pip-install-rkpzj2x6/pypiwin32_8a66222935a047f88d26fcc7255f3678/setup.py", line 121
print "Building pywin32", pywin32_version
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
SyntaxError: Missing parentheses in call to 'print'. Did you mean print(...)?
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed
× Encountered error while generating package metadata.
╰─> See above for output.
note: This is an issue with the package mentioned above, not pip.
hint: See above for details.
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$ sudo pip install python-docx
[sudo] password for rootroot:
Collecting python-docx
Downloading python-docx-0.8.11.tar.gz (5.6 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.6/5.6 MB 3.7 MB/s eta 0:00:00
Preparing metadata (setup.py) ... done
Collecting lxml>=2.3.2
Downloading lxml-4.9.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.manylinux_2_24_x86_64.whl (7.1 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.1/7.1 MB 7.4 MB/s eta 0:00:00
Building wheels for collected packages: python-docx
Building wheel for python-docx (setup.py) ... done
Created wheel for python-docx: filename=python_docx-0.8.11-py3-none-any.whl size=184507 sha256=ed4fb4189610b03122beb8714709c37307ccfa92dc1b5ea1e4725240712f5d3c
Stored in directory: /root/.cache/pip/wheels/80/27/06/837436d4c3bd989b957a91679966f207bfd71d358d63a8194d
Successfully built python-docx
Installing collected packages: lxml, python-docx
Successfully installed lxml-4.9.2 python-docx-0.8.11
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$ ll *.docx
-rwxr--r-- 1 rootroot rootroot 80786 5月 4 20:56 MIDE-599.google.docx*
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$ import docx
Command 'import' not found, but can be installed with:
sudo apt install graphicsmagick-imagemagick-compat # version 1.4+really1.3.38-1ubuntu0.1, or
sudo apt install imagemagick-6.q16 # version 8:6.9.11.60+dfsg-1.3ubuntu0.22.04.3
sudo apt install imagemagick-6.q16hdri # version 8:6.9.11.60+dfsg-1.3ubuntu0.22.04.3
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$ pyton3
Command 'pyton3' not found, did you mean:
command 'python3' from deb python3 (3.10.6-1~22.04)
Try: sudo apt install <deb name>
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$ python3
Python 3.10.6 (main, Mar 10 2023, 10:55:28) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> import docx
>>>
>>> doc = docx.Document('MIDE-599.google.docx')
>>>
>>> docText = '\n'.join([paragraph.text for paragraph in doc.paragraphs])
>>>
>>> print(docText)
1、遍历input目录中的全部docx档。
input2.py
import os
file = 'input'
for root, dirs, files in os.walk(file):
for file in files:
path = os.path.join(root, file)
print(path)
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$ python input2.py
input/SSNI-205.google.docx
input/TEK-072.google.docx
input/TEK-076.google.docx
input/OAE-101.google.docx
input/SIVR-001.google.docx
input/OAE-165.google.docx
input/SSNI-101.google.docx
input/SIVR-012 2.google.docx
input/SIVR-002.google.docx
input/SSNI-009.google.docx
input/SIVR-003.google.docx
input/SIVR-017 2.google.docx
input/SSNI-493.google.docx
input/SNIS-896.google.docx
input/SSNI-409.google.docx
input/SSNI-730.google.docx
input/SIVR-034 1.google.docx
input/SIVR-067 1.google.docx
input/OFJE-189.google.docx
input/SIVR-067 3.google.docx
input/SIVR-044 2.google.docx
input/SSNI-542.google.docx
input/SIVR-034 2.google.docx
input/SIVR-016 2.google.docx
input/SIVR-016 1.google.docx
input/SSNI-229.google.docx
input/SSNI-030.google.docx
input/SSNI-127.google.docx
input/SIVR-033 5.google.docx
input/SIVR-061 1.google.docx
input/SNIS-986.google.docx
input/SIVR-033 2.google.docx
input/SIVR-033 3.google.docx
input/SSNI-516.google.docx
input/SSNI-388.google.docx
input/SSNI-473.google.docx
input/SNIS-872.google.docx
input/SIVR-067 2.google.docx
input/OFJE-139 2.google.docx
input/SNIS-786.google.docx
input/SSNI-674.google.docx
input/SSNI-178.google.docx
input/TEK-083Ö»ÓÐÒôƵ.google.docx
input/SNIS-964.google2.docx
input/SSNI-644.google.docx
input/SSNI-301.google.docx
input/TEK-080.google.docx
input/SIVR-044 1.google.docx
input/SSNI-566.google.docx
input/TEK-071.google.docx
input/TEK-097.google.docx
input/SSNI-279.google.docx
input/SIVR-061 4.google.docx
input/SSNI-344.google.docx
input/SIVR-033 1.google.docx
input/SSNI-618.google.docx
input/SIVR-017 1.google.docx
input/MIDE-599.google.docx
input/SNIS-850 1.google.docx
input/SIVR-061 2.google.docx
input/SSNI-254.google.docx
input/pSSNI-473.google.docx
input/SSNI-589.google.docx
input/SIVR-015 1.google.docx
input/SSNI-432.google.docx
input/SSNI-152.google.docx
input/SIVR-061 3.google.docx
input/SNIS-800.google.docx
input/SSNI-322.google.docx
input/SSNI-077.google.docx
input/SNIS-919.google.docx
input/SSNI-452.google.docx
input/SIVR-033 6.google.docx
input/TEK-073.google.docx
input/TEK-081Ö»ÓÐÒôƵ.google.docx
input/OFJE-139 1.google.docx
input/SNIS-850 2.google.docx
input/SNIS-964.google.docx
input/SIVR-033 4.google.docx
input/SSNI-703.google.docx
input/SIVR-015 2.google.docx
input/TEK-067.google.docx
input/SSNI-054.google.docx
input/SIVR-012 1.google.docx
input/SIVR-017 3.google.docx
input/SIVR-034 3.google.docx
input/TEK-079Ö»ÓÐÒôƵ.google.docx
input/OFJE-236.google.docx
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$
2、docx档转txt档。
docx3.py
import docx
doc = docx.Document('MIDE-599.google.docx')
docText = '\n'.join([paragraph.text for paragraph in doc.paragraphs])
#print(docText)
f=open("MIDE-599.google.txt","wb")
#f.write(response.content)
#f.write(docText)
#f.write(docText.decode())
f.write(docText.encode())
f.close()
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$ python3 docx3.py
Traceback (most recent call last):
File "/home/rootroot/docx3.py", line 8, in <module>
f.write(docText)
TypeError: a bytes-like object is required, not 'str'
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$
rootroot@rootroot-adol-ADOLBOOK-I421UAY-ADOL14UA:~$
3、TXT档保存在output目录中。
input4.py
import os
import docx
file = 'input'
for root, dirs, files in os.walk(file):
for file in files:
portion = os.path.splitext(file)
if portion[1]==".docx":
#doc = docx.Document('MIDE-599.google.docx')
path_docx = os.path.join(root, file)
#doc = docx.Document('path_docx')
doc = docx.Document(path_docx)
docText = '\n'.join([paragraph.text for paragraph in doc.paragraphs])
newname = portion[0] + ".txt"
#path = os.path.join(root, file)
#path = os.path.join(root, newname)
path = os.path.join("output/", newname)
#print(path)
#f=open("MIDE-599.google.txt","wb")
f=open(path,"wb")
f.write(docText.encode())
f.close()
在Ubuntu22.04 下是 UTF-8格式,WIN10下默认的是ANSI格式。
不能用BeyondCompare3.5直接比对!
参考资料:
ubuntu python docx txt
ubuntu python 批量 docx txt
python ubuntu 遍历目录
ubuntu python docx
ubuntu python 遍历
python如何遍历文件夹下的文件 python遍历文件夹中的文件
python 更换 扩展名
Python修改文件后缀名
https://blog.csdn.net/weixin_44735393/article/details/119747619
python批量修改文件扩展名
import os
dir='/home/下载/'#文件所在目录
files = os.listdir(dir)#列出目录下所有文件名
files.sort()#按文件名排序
#print('files',files)
#遍历文件
for name in files:
lname=name.split('.')#将文件名分割成名+后缀
print(lname)
if lname[-1]=='txt':#判断
print(lname)
newname=lname[0]+'.tif'#修改
print(newname)
os.rename(dir+name, dir+newname)#写进文件夹
http://bjst.net.cn/ask/show-392333.html
精选回答:回答日期:2022年11月27日 以下内容仅供参考!
https://wenku.baidu.com/view/710331a94593daef5ef7ba0d4a7302768f996f55.html
Python修改文件后缀名
https://blog.csdn.net/faihung/article/details/90516180
成功解决TypeError: a bytes-like object is required, not 'str'
解决思路
问题出在python3.5和Python2.7在套接字返回值解码上有区别:
python bytes和str两种类型可以通过函数encode()和decode()相互转换,
str→bytes:encode()方法。str通过encode()方法可以转换为bytes。
bytes→str:decode()方法。如果我们从网络或磁盘上读取了字节流,那么读到的数据就是bytes。要把bytes变为str,就需要用decode()方法。
https://www.zhangshengrong.com/p/281oqB7DNw/
Ubuntu下使用python读取doc和docx文档的内容方法
sudo pip install python-docx
https://www.cnblogs.com/vulcat/p/12547027.html
用python实现批量替换.doc文件文件内容
https://blog.csdn.net/wx17343624830/article/details/127425605
使用Python实现对word的批量操作