20230507使用python3批量转换DOCX文档为TXT
2023/5/7 20:22
WIN10使用python3.11
# – coding: gbk –
import os
from pdf2docx import Converter
from win32com import client as wc
"""这里需要安转包pywin32com"""
# 读取pdf文件文本内容
def DocxToTxt(inputFinallyPath, outputFinallyPath):
wordhandle = wc.Dispatch("Word.Application")
wordhandle.Visible = 0 # 后台运行,不显示
wordhandle.DisplayAlerts = 0 # 不警告
doc = wordhandle.Documents.Open(inputFinallyPath)
doc.SaveAs(outputFinallyPath, 4) # txt=4, html=10, docx=16, pdf=17
doc.Close
if __name__ == '__main__':
# 输入路径
inputPath = r'D:\pythonproject\pdf_to_txt\input'
#输出路径,最好采用绝对路径
outputPath = r'D:\pythonproject\pdf_to_txt\output'
# 将文件夹的文件列举出来
pdfList = os.listdir(inputPath)
# 批量读取存储
pdf_num = 1
for li in pdfList:
print(li)
inputFinallyPath = inputPath + '/' + li
li = li.replace('.docx', '.txt')
outputFinallyPath = outputPath + '/' + li
DocxToTxt(inputFinallyPath, outputFinallyPath)
print('第 %d 篇docx已转换为txt' % pdf_num)
pdf_num = pdf_num + 1
print('共计%d篇docx文章已完全转换为txt' % (pdf_num-1))
使用google翻译将88份日语DOCX字幕翻译成为简体中文版本了!
Microsoft Windows [版本 10.0.19044.2728]
(c) Microsoft Corporation。保留所有权利。
C:\Users\QQ>python3
C:\Users\QQ>python
C:\Users\QQ>python
Python 3.11.3 (tags/v3.11.3:f3909b8, Apr 4 2023, 23:49:59) [MSC v.1934 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> from pdf2docx import Converter
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'pdf2docx'
>>>
Microsoft Windows [版本 10.0.19044.2728]
(c) Microsoft Corporation。保留所有权利。
C:\Users\QQ>pip install pdf2docx
Collecting pdf2docx
Downloading pdf2docx-0.5.6-py3-none-any.whl (148 kB)
---------------------------------------- 148.4/148.4 kB 368.3 kB/s eta 0:00:00
Collecting PyMuPDF>=1.19.0
Downloading PyMuPDF-1.22.2-cp311-cp311-win_amd64.whl (11.7 MB)
---------------------------------------- 11.7/11.7 MB 12.8 MB/s eta 0:00:00
Collecting python-docx>=0.8.10
Downloading python-docx-0.8.11.tar.gz (5.6 MB)
---------------------------------------- 5.6/5.6 MB 1.6 MB/s eta 0:00:00
Preparing metadata (setup.py) ... done
Collecting fonttools>=4.24.0
Downloading fonttools-4.39.3-py3-none-any.whl (1.0 MB)
---------------------------------------- 1.0/1.0 MB 12.8 MB/s eta 0:00:00
Collecting numpy>=1.17.2
Downloading numpy-1.24.3-cp311-cp311-win_amd64.whl (14.8 MB)
---------------------------------------- 14.8/14.8 MB 21.1 MB/s eta 0:00:00
Collecting opencv-python>=4.5
Downloading opencv_python-4.7.0.72-cp37-abi3-win_amd64.whl (38.2 MB)
---------------------------------------- 38.2/38.2 MB 12.6 MB/s eta 0:00:00
Collecting fire>=0.3.0
Downloading fire-0.5.0.tar.gz (88 kB)
---------------------------------------- 88.3/88.3 kB 4.9 MB/s eta 0:00:00
Preparing metadata (setup.py) ... done
Collecting six
Downloading six-1.16.0-py2.py3-none-any.whl (11 kB)
Collecting termcolor
Downloading termcolor-2.3.0-py3-none-any.whl (6.9 kB)
Collecting lxml>=2.3.2
Downloading lxml-4.9.2-cp311-cp311-win_amd64.whl (3.8 MB)
---------------------------------------- 3.8/3.8 MB 10.0 MB/s eta 0:00:00
Installing collected packages: termcolor, six, PyMuPDF, numpy, lxml, fonttools, python-docx, opencv-python, fire, pdf2docx
WARNING: The script f2py.exe is installed in 'C:\Users\QQ\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\Scripts' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
WARNING: The scripts fonttools.exe, pyftmerge.exe, pyftsubset.exe and ttx.exe are installed in 'C:\Users\QQ\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\Scripts' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
DEPRECATION: python-docx is being installed using the legacy 'setup.py install' method, because it does not have a 'pyproject.toml' and the 'wheel' package is not installed. pip 23.1 will enforce this behaviour change. A possible replacement is to enable the '--use-pep517' option. Discussion can be found at https://github.com/pypa/pip/issues/8559
Running setup.py install for python-docx ... done
DEPRECATION: fire is being installed using the legacy 'setup.py install' method, because it does not have a 'pyproject.toml' and the 'wheel' package is not installed. pip 23.1 will enforce this behaviour change. A possible replacement is to enable the '--use-pep517' option. Discussion can be found at https://github.com/pypa/pip/issues/8559
Running setup.py install for fire ... done
WARNING: The script pdf2docx.exe is installed in 'C:\Users\QQ\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\LocalCache\local-packages\Python311\Scripts' which is not on PATH.
Consider adding this directory to PATH or, if you prefer to suppress this warning, use --no-warn-script-location.
Successfully installed PyMuPDF-1.22.2 fire-0.5.0 fonttools-4.39.3 lxml-4.9.2 numpy-1.24.3 opencv-python-4.7.0.72 pdf2docx-0.5.6 python-docx-0.8.11 six-1.16.0 termcolor-2.3.0
[notice] A new release of pip available: 22.3.1 -> 23.1.2
[notice] To update, run: C:\Users\QQ\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip
C:\Users\QQ>
Microsoft Windows [版本 10.0.19044.2728]
(c) Microsoft Corporation。保留所有权利。
C:\Users\QQ>pip install win32com
ERROR: Could not find a version that satisfies the requirement win32com (from versions: none)
ERROR: No matching distribution found for win32com
[notice] A new release of pip available: 22.3.1 -> 23.1.2
[notice] To update, run: C:\Users\QQ\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip
C:\Users\QQ>
C:\Users\QQ>pip install pypwin32
ERROR: Could not find a version that satisfies the requirement pypwin32 (from versions: none)
ERROR: No matching distribution found for pypwin32
[notice] A new release of pip available: 22.3.1 -> 23.1.2
[notice] To update, run: C:\Users\QQ\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip
C:\Users\QQ>
C:\Users\QQ>pip install pypiwin32
Collecting pypiwin32
Downloading pypiwin32-223-py3-none-any.whl (1.7 kB)
Collecting pywin32>=223
Downloading pywin32-306-cp311-cp311-win_amd64.whl (9.2 MB)
---------------------------------------- 9.2/9.2 MB 895.2 kB/s eta 0:00:00
Installing collected packages: pywin32, pypiwin32
Successfully installed pypiwin32-223 pywin32-306
[notice] A new release of pip available: 22.3.1 -> 23.1.2
[notice] To update, run: C:\Users\QQ\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip
C:\Users\QQ>
C:\Users\QQ>
Microsoft Windows [版本 10.0.19044.2728]
(c) Microsoft Corporation。保留所有权利。
C:\Users\QQ>d:
D:\>dir *.pty
驱动器 D 中的卷是 DATA
卷的序列号是 547F-1046
D:\ 的目录
找不到文件
D:\>dir *.py
驱动器 D 中的卷是 DATA
卷的序列号是 547F-1046
D:\ 的目录
2023/05/07 19:55 1,221 pdf2doc2.py
1 个文件 1,221 字节
0 个目录 195,912,142,848 可用字节
D:\>python pdf2doc2.py
SyntaxError: Non-UTF-8 code starting with '\xd5' in file D:\pdf2doc2.py on line 4, but no encoding declared; see https://peps.python.org/pep-0263/ for details
D:\>
Microsoft Windows [版本 10.0.19044.2728]
(c) Microsoft Corporation。保留所有权利。
C:\Users\QQ>d:
D:\>dir *.pty
驱动器 D 中的卷是 DATA
卷的序列号是 547F-1046
D:\ 的目录
找不到文件
D:\>dir *.py
驱动器 D 中的卷是 DATA
卷的序列号是 547F-1046
D:\ 的目录
2023/05/07 19:55 1,221 pdf2doc2.py
1 个文件 1,221 字节
0 个目录 195,912,142,848 可用字节
D:\>python pdf2doc2.py
SyntaxError: Non-UTF-8 code starting with '\xd5' in file D:\pdf2doc2.py on line 4, but no encoding declared; see https://peps.python.org/pep-0263/ for details
D:\>
D:\>python pdf2doc2.py
File "D:\pdf2doc2.py", line 36
print('共计%d篇docx文章已完全转换为txt' pdf_num-1))
^
SyntaxError: unmatched ')'
D:\>python pdf2doc2.py
MIDE-599.google.docx
第 1 篇docx已转换为txt
OAE-101.google.docx
第 2 篇docx已转换为txt
OAE-165.google.docx
第 3 篇docx已转换为txt
OFJE-139 1.google.docx
第 4 篇docx已转换为txt
OFJE-139 2.google.docx
第 5 篇docx已转换为txt
OFJE-189.google.docx
第 6 篇docx已转换为txt
OFJE-236.google.docx
第 7 篇docx已转换为txt
pSSNI-473.google.docx
第 8 篇docx已转换为txt
SIVR-001.google.docx
第 9 篇docx已转换为txt
SIVR-002.google.docx
第 10 篇docx已转换为txt
SIVR-003.google.docx
第 11 篇docx已转换为txt
SIVR-012 1.google.docx
第 12 篇docx已转换为txt
SIVR-012 2.google.docx
第 13 篇docx已转换为txt
SIVR-015 1.google.docx
第 14 篇docx已转换为txt
SIVR-015 2.google.docx
第 15 篇docx已转换为txt
SIVR-016 1.google.docx
第 16 篇docx已转换为txt
SIVR-016 2.google.docx
第 17 篇docx已转换为txt
SIVR-017 1.google.docx
第 18 篇docx已转换为txt
SIVR-017 2.google.docx
第 19 篇docx已转换为txt
SIVR-017 3.google.docx
第 20 篇docx已转换为txt
SIVR-033 1.google.docx
第 21 篇docx已转换为txt
SIVR-033 2.google.docx
第 22 篇docx已转换为txt
SIVR-033 3.google.docx
第 23 篇docx已转换为txt
SIVR-033 4.google.docx
第 24 篇docx已转换为txt
SIVR-033 5.google.docx
第 25 篇docx已转换为txt
SIVR-033 6.google.docx
第 26 篇docx已转换为txt
SIVR-034 1.google.docx
第 27 篇docx已转换为txt
SIVR-034 2.google.docx
第 28 篇docx已转换为txt
SIVR-034 3.google.docx
第 29 篇docx已转换为txt
SIVR-044 1.google.docx
第 30 篇docx已转换为txt
SIVR-044 2.google.docx
第 31 篇docx已转换为txt
SIVR-061 1.google.docx
第 32 篇docx已转换为txt
SIVR-061 2.google.docx
第 33 篇docx已转换为txt
SIVR-061 3.google.docx
第 34 篇docx已转换为txt
SIVR-061 4.google.docx
第 35 篇docx已转换为txt
SIVR-067 1.google.docx
第 36 篇docx已转换为txt
SIVR-067 2.google.docx
第 37 篇docx已转换为txt
SIVR-067 3.google.docx
第 38 篇docx已转换为txt
SNIS-786.google.docx
第 39 篇docx已转换为txt
SNIS-800.google.docx
第 40 篇docx已转换为txt
SNIS-850 1.google.docx
第 41 篇docx已转换为txt
SNIS-850 2.google.docx
第 42 篇docx已转换为txt
SNIS-872.google.docx
第 43 篇docx已转换为txt
SNIS-896.google.docx
第 44 篇docx已转换为txt
SNIS-919.google.docx
第 45 篇docx已转换为txt
SNIS-964.google.docx
第 46 篇docx已转换为txt
SNIS-964.google2.docx
第 47 篇docx已转换为txt
SNIS-986.google.docx
第 48 篇docx已转换为txt
SSNI-009.google.docx
第 49 篇docx已转换为txt
SSNI-030.google.docx
第 50 篇docx已转换为txt
SSNI-054.google.docx
第 51 篇docx已转换为txt
SSNI-077.google.docx
第 52 篇docx已转换为txt
SSNI-101.google.docx
第 53 篇docx已转换为txt
SSNI-127.google.docx
第 54 篇docx已转换为txt
SSNI-152.google.docx
第 55 篇docx已转换为txt
SSNI-178.google.docx
第 56 篇docx已转换为txt
SSNI-205.google.docx
第 57 篇docx已转换为txt
SSNI-229.google.docx
第 58 篇docx已转换为txt
SSNI-254.google.docx
第 59 篇docx已转换为txt
SSNI-279.google.docx
第 60 篇docx已转换为txt
SSNI-301.google.docx
第 61 篇docx已转换为txt
SSNI-322.google.docx
第 62 篇docx已转换为txt
SSNI-344.google.docx
第 63 篇docx已转换为txt
SSNI-388.google.docx
第 64 篇docx已转换为txt
SSNI-409.google.docx
第 65 篇docx已转换为txt
SSNI-432.google.docx
第 66 篇docx已转换为txt
SSNI-452.google.docx
第 67 篇docx已转换为txt
SSNI-473.google.docx
第 68 篇docx已转换为txt
SSNI-493.google.docx
第 69 篇docx已转换为txt
SSNI-516.google.docx
第 70 篇docx已转换为txt
SSNI-542.google.docx
第 71 篇docx已转换为txt
SSNI-566.google.docx
第 72 篇docx已转换为txt
SSNI-589.google.docx
第 73 篇docx已转换为txt
SSNI-618.google.docx
第 74 篇docx已转换为txt
SSNI-644.google.docx
第 75 篇docx已转换为txt
SSNI-674.google.docx
第 76 篇docx已转换为txt
SSNI-703.google.docx
第 77 篇docx已转换为txt
SSNI-730.google.docx
第 78 篇docx已转换为txt
TEK-067.google.docx
第 79 篇docx已转换为txt
TEK-071.google.docx
第 80 篇docx已转换为txt
TEK-072.google.docx
第 81 篇docx已转换为txt
TEK-073.google.docx
第 82 篇docx已转换为txt
TEK-076.google.docx
第 83 篇docx已转换为txt
TEK-079只有音频.google.docx
第 84 篇docx已转换为txt
TEK-080.google.docx
第 85 篇docx已转换为txt
TEK-081只有音频.google.docx
第 86 篇docx已转换为txt
TEK-083只有音频.google.docx
第 87 篇docx已转换为txt
TEK-097.google.docx
第 88 篇docx已转换为txt
D:\>
参考资料:
python 批量 转换 DOCX TXT
https://blog.csdn.net/weixin_46255747/article/details/129961988
python实现批量docx转txt
ModuleNotFoundError: No module named 'pdf2docx'
python win32com pip install
https://blog.csdn.net/qq_45662588/article/details/130315080
python3.9之安装win32com库的解决办法
https://blog.csdn.net/longe20111104/article/details/129754624
pip install win32com报错解决办法
pip install pypiwin32
SyntaxError: Non-UTF-8 code starting with '\xd5' in file D:\pdf2doc2.py on line 4, but no encoding d
https://blog.csdn.net/coco_apple/article/details/113437552
SyntaxError: Non-UTF-8 code starting with ‘\xd5‘ in file
# – coding: gbk –