1. python常用pdf库

名称	特点
PyPDF2	已不再维护，继任者PyPDF4 ,但很长时间没有更新了,能读不能写
pdfrw	能读不能写，但可以兼容ReportLab写
ReportLab	商业版的开源版本，能写不能读
pikepdf	能读不能写
pdfplumber	能读不能写
PyMuPDF	读写均可,基于GPL协议
borb	纯Python库，支持读、写,基于GPL协议

其中前几种偏重于读或者写，PyMuPDF和borb读写兼具，但这两个库都基于GPL开源协议，对于商业开发不太友好。

介绍之前，我们通过读取一个已有的PDF中的文字来测试下时提取内容的准确度，pdfrw暂时跳过，因为没有找到其提取文本的api。ReportLab不能读，跳过。

2.读取测试

准备的测试的PDF，截图展示的是第5页内容：
在这里插入图片描述

2.1 PyPDF2 示例及结果

#!/usr/bin/python
from PyPDF2 import PdfReader
pdf = PdfReader("yz.pdf")
page = pdf.pages[4]
print(page.extract_text())

内容被正确读取，但是格式变为每行一个字。

在这里插入图片描述

2.2 PyPDF4 示例及结果

from PyPDF4 import PdfFileReader

pdf = open('yz.pdf','rb')
reader = PdfFileReader(pdf)
page = reader.getPage(4)
print(page.extractText().strip())

在这里插入图片描述
PyPDF4 输出的是内容流,暂无法解析为文本.

2.3 pikepdf

pikepdf 的官方文档上有这么一段话：

If you guessed that the content streams were the place to look for text inside a PDF – you’d be correct. 
Unfortunately, extracting the text is fairly difficult because content stream actually specifies as a font 
and glyph numbers to use. Sometimes, there is a 1:1 transparent mapping between Unicode numbers and 
glyph numbers, and dump of the content stream will show the text. In general, you cannot rely on there 
being a transparent mapping; in fact, it is perfectly legal for a font to specify no Unicode mapping 
at all, or to use an unconventional mapping (when a PDF contains a subsetted font for example).

We strongly recommend against trying to scrape text from the content stream.

pikepdf does not currently implement text extraction. We recommend pdfminer.six, a read-only 
text extraction tool. If you wish to write PDFs containing text, consider reportlab.

如果您猜测内容流是在PDF中查找文本的地方，那么您是正确的。不幸的是，提取文本相当困难，因为内容流实际上指定了要使用的字体和字形
数字。有时，Unicode数字和字形数字之间有1:1的透明映射，内容流的转储将显示文本。一般来说，你不能依赖于一个透明的映射;事实上，
字体完全可以不指定Unicode映射，或者使用非常规的映射(例如，当PDF包含一个子集字体时)。

我们强烈建议不要尝试从内容流中抓取文本。

Pikepdf目前不实现文本提取。我们推荐pdfminer。一个只读文本提取工具。如果您希望编写包含文本的pdf，请考虑reportlab。

2.4 pdfplumber 示例和结果

import pdfplumber

with pdfplumber.open("yz.pdf") as pdf:
    page = pdf.pages[4]
    chars = page.chars
    content = ''
    for char in chars:
        content += char['text']
    print(content)

pdfplumber是按字符读取，上面的示例代码中是对字符进行了拼接。结果如下：
在这里插入图片描述

2.5 PyMuPDF 示例及结果

import fitz
doc = fitz.open("yz.pdf")
page = doc.load_page(4)
text = page.get_text("text")
print(text)

这是目前提取文本结果最完美的一个:

$ python e6.py
1897年，在这里，什么都没有发生。
——科罗拉多州伍迪克里克小旅馆墙壁上的牌匾

2.6 borb示例及结果

以下示例代码为官方示例代码:

import typing
from borb.pdf import Document
from borb.pdf import PDF
from borb.toolkit import SimpleTextExtraction

def main():
    # read the Document
    doc: typing.Optional[Document] = None
    l: SimpleTextExtraction = SimpleTextExtraction()
    with open("yz.pdf", "rb") as in_file_handle:
        doc = PDF.loads(in_file_handle, [l])

    # check whether we have read a Document
    assert doc is not None

    # print the text on the first Page
    print(l.get_text()[4])
if __name__ == "__main__":
    main()

  # 处理字体时报错
  File "/home/eva/.local/lib/python3.11/site-packages/borb/pdf/canvas/font/composite_font/font_type_0.py", line 86, in character_identifier_to_unicode
    assert encoding_name in ["Identity", "Identity-H"]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

鉴于以上测试结果,接下来的演示中将使用pdfplumber + Reportlab 来进行.

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.coloradmin.cn/o/197450.html

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈，一经查实，立即删除！