搜索了一下,大致有这些库能将PDF转txt
1. PyPDF/PyPDF2(截止2024.03.28这两个已经合并成了一个)pypdf · PyPI
2. pdfplumber GitHub - jsvine/pdfplumber: Plumb a PDF for detailed information about each char, rectangle, line, et cetera — and easily extract text and tables.
2. PyMuPDF PyMuPDF · PyPI
3. PDFMiner (有5年没更新了,不建议使用)GitHub - euske/pdfminer: Python PDF Parser (Not actively maintained). Check out pdfminer.six.
4. pdftotext (Mac系统没安装成功,故未试用) GitHub - jalan/pdftotext: Simple PDF text extraction
要转txt的PDF有一页内容如下:
其中PyPDF和pdfplumber的代码很相似都用extract_text, PyMuPDF则用get_text:
import pdfplumber
from pypdf import PdfReader
import fitz # PyMuPDF
fname = "26.pdf"
with pdfplumber.open(fname) as pdf:
print(len(pdf.pages))
for page in pdf.pages:
text = page.extract_text()#提取文本
print(text)
with open('1.txt', 'w') as f:
f.write(text)
pdf = PdfReader(fname)
print(len(pdf.pages))
for page in pdf.pages:
text = page.extract_text()
print(text)
with open('2.txt', 'w') as f:
f.write(text)
with fitz.open(fname) as pdf:
text = chr(12).join([page.get_text() for page in pdf])
with open('3.txt', 'w') as f:
f.write(text)
执行结果如下(从左到右分别是pdfplumber/PyPDF/PyMuPDF)
对比发现:
1. pdfplumber未能正确处理分栏
2. PyPDF 未能正确识别换行
综上,选择PyMuPDF用来提取PDF中的文字,做成脚本(pdf2txt.py)内容如下:
#!/usr/bin/env python
"""PDF转txt
Usage::
>>> python pdf2txt.py <pdf>
"""
import os
import sys
from functools import partial
from pathlib import Path
# pip install PyMuPDF
import fitz # type:ignore[import-untyped]
def _get_text(page, remove_header_footer):
clip = None
if remove_header_footer:
height = 50 # 假设页眉页脚的高度为50
rect = page.rect
clip = fitz.Rect(0, height, rect.width, rect.height - height)
return page.get_text(clip=clip)
def pdf2text(fname: str, remove_header_footer=True) -> str:
"""提取PDF文本内容
:param fname: 文件路径
:param remove_header_footer: 是否去除页眉页脚
"""
if "~" in fname:
fname = os.path.expanduser(fname)
get_text = partial(_get_text, remove_header_footer=remove_header_footer)
with fitz.open(fname) as doc: # open document
text = chr(12).join(get_text(page) for page in doc)
return text
def main() -> None:
if not sys.argv[1:]:
if "PYCHARM_HOSTED" not in os.environ:
print(__doc__)
return
fname = input("请输入PDF文件路径:")
else:
fname = sys.argv[1]
text = pdf2text(fname)
new_name = Path(fname).stem + ".txt"
size = Path(new_name).write_bytes(text.encode())
print(f"Save to {new_name} with {size=}")
if __name__ == "__main__": # pragma: no cover
main()