使用大型语言模型进行文档解析（附带代码）

动机

多年来，正则表达式一直是我解析文档的首选工具，我相信对于许多其他技术人员和行业来说也是如此。

尽管正则表达式在某些情况下功能强大且成功，但它们常常难以应对现实世界文档的复杂性和多变性。

另一方面，大型语言模型提供了更强大、更灵活的方法来处理多种类型的文档结构和内容类型。

系统总体工作流程

清楚了解正在构建的系统的主要组件总是好的。为了简单起见，让我们关注研究论文处理的场景。

工作流程总体上有三个主要组成部分：输入、处理和输出。
首先，提交文件（在本例中为PDF格式的科研论文）进行处理。
处理组件的第一个模块从每个 PDF 中提取原始数据，并将其与包含大型语言模型指令的提示相结合，以有效地提取数据。
然后，大型语言模型使用提示来提取所有元数据。
对于每个PDF，最终结果以JSON格式保存，可用于进一步分析。

但是，为什么要费心使用 LLM，而不是使用正则表达式呢？

正则表达式（Regex）在处理研究论文结构的复杂性时存在很大的局限性，其中一些局限性如下所示：

1. 文档结构的灵活性

Regex需要每个文档结构都有特定的模式，当给定的文档偏离预期的格式时就会失败。
LLMs能够自动理解和适应各种文档结构，并且无论位于文档的什么位置，都能够识别相关信息。

2. 上下文理解

Regex无需理解上下文或含义即可匹配模式。
LLMs对每个文档的含义有细致的理解，从而可以更准确地提取相关信息。

3. 维护和可扩展性

Regex需要随着文档格式的变化而不断更新。添加对新类型信息的支持需要编写一个全新的正则表达式。
LLMs可以轻松适应新的文档类型，只需在初始提示中进行最少的更改，从而使其更具可扩展性。

构建文档解析工作流程

上述理由足以用于LLMs解析研究论文等复杂文档。

我们用于说明的文件是：

你所需要的全部都是注意力
YOLOv5 和 Faster R-CNN 在非合作目标自主导航方面的性能研究，同样来自Arxiv 网站

本节提供了利用大型语言模型构建真实世界文档解析系统的所有步骤，我相信这有可能改变您对人工智能及其功能的看法。

如果您更喜欢视频，我会在另一边等您。

完整的视频教程

代码结构

代码结构如下：

<span style="color:rgba(0, 0, 0, 0.8)"><span style="background-color:#ffffff"><span style="background-color:#f9f9f9"><span style="color:#242424">项目
   | 
   |---Extract_Metadata_With_Large_Language_Models.ipynb 
   |
  数据
   | 
   |---- extracted_metadata/ 
   |---- 1706.03762v7.pdf 
   |---- 2301.09056v1.pdf 
   |---- 提示
           | 
           |------ scientific_papers_prompt.txt</span></span></span></span>

project文件夹是根文件夹，包含data文件夹和笔记本
data文件夹里有两个文件夹，以及上面的两张纸：extracted_metadata和prompts
extracted_metadata目前为空，将包含 json 文件
prompts文件夹有文本格式的提示

要提取的元数据

我们首先需要明确需要提取的属性的目标，为了简单起见，我们将重点关注我们的场景中的六个属性。

论文标题
出版年份
作者
联系作者
抽象的
摘要

然后使用这些属性来定义提示，清楚地解释每个属性的含义以及最终输出的格式。

文档的成功解析依赖于提示，该提示清楚地解释每个属性的含义以及以何种格式提取最终结果。

<span style="color:rgba(0, 0, 0, 0.8)"><span style="background-color:#ffffff"><span style="background-color:#f9f9f9"><span style="color:#242424">科学研究论文：
--- 
{document} 
---

您是分析科学研究论文的专家。 请仔细阅读上面提供的研究论文，并提取以下关键信息：

从研究论文中提取以下六 (6) 个属性：
- 论文标题：研究论文的全名
- 出版年份：论文发表的年份
- 作者：论文所有作者的全名
- 作者联系方式：字典列表，其中每个字典包含每个作者的以下键：
  - 姓名：作者的全名
  - 机构：作者的机构隶属关系
  - 电子邮件：作者的电子邮件地址（如果提供）
- 摘要：论文摘要的全文
- 摘要摘要：用 2-3 句话简洁地总结摘要，突出重点

指南：
- 提取的信息应属实，并准确无误。
- 除摘要外，应极其简洁，摘要应完整复制。
- 提取的实体应该是独立的，并且不需要论文的其余部分就能轻松理解。
- 如果论文中缺少任何属性，请将该字段留空，而不是猜测。
- 对于摘要摘要，重点介绍研究的主要目标、方法和主要发现。
- 对于作者联系方式，请为每个作者创建一个条目，即使缺少一些信息。如果没有提供作者的电子邮件或机构，请在字典中将该字段留空。

以 JSON 格式回答。 JSON 应包含 6 个键：“PaperTitle”、“PublicationYear”、“Authors”、“AuthorContact”、“Abstract”和“SummaryAbstract”。 “AuthorContact”应该是如上所述的字典列表。</span></span></span></span>

提示中发生了六件主要的事情，让我们来分解一下。

文档占位符

Scientific research paper:
---
{document}
---

用符号定义{}，表明文档的全文将包含在何处以供分析。

2. 角色分配

为了更好地执行任务，模型被分配了一个角色，并在下文中进行了定义，设置了上下文并指示人工智能成为科学研究论文分析方面的专家。

You are an expert in analyzing scientific research papers.

3. 提取说明

本节指定应从文档中提取的信息片段。

Extract these six (6) properties from the research paper:

4. 属性定义

这里定义了上述每个属性，并详细说明了要包含哪些信息以及它们的格式化策略。例如，Author Contact是包含其他详细信息的词典列表。

5. 指南

这些指南告诉人工智能在提取过程中应遵循的规则，例如保持准确性以及如何处理缺失的信息。

6. 预期输出格式

这是最后一步，它指定了回答时要考虑的确切格式，即json。

Answer in JSON format. The JSON should contain 6 keys: ...

图书馆

太好了，现在让我们开始安装必要的库。

我们的文档解析系统由几个库构建，每个组件的主要库如下所示：

PDF 处理：pdfminer.six、、PyPDF2和poppler-utils用于处理各种 PDF 格式和结构。
文本提取：unstructured及其依赖包（unstructured-inference、unstructured-pytesseract），用于从文档中智能提取内容。
OCR 功能：tesseract-ocr用于识别图像或扫描文档中的文本。
图像处理：pillow-heif用于图像处理任务。
AI 集成：openai在我们的信息提取过程中利用 GPT 模型的库。

%%bash

pip -qqq install pdfminer.six
pip -qqq install pillow-heif==0.3.2
pip -qqq install matplotlib
pip -qqq install unstructured-inference
pip -qqq install unstructured-pytesseract
pip -qqq install tesseract-ocr
pip -qqq install unstructured
pip -qqq install openai
pip -qqq install PyPDF2

apt install -V tesseract-ocr
apt install -V libtesseract-dev

sudo apt-get update
apt-get install -V poppler-utils

安装成功后，导入操作如下：

import os
import re
import json
import openai
from pathlib import Path
from openai import OpenAI
from PyPDF2 import PdfReader
from google.colab import userdata
from unstructured.partition.pdf import partition_pdf
from tenacity import retry, wait_random_exponential, stop_after_attempt

设置凭证

在深入研究核心功能之前，我们需要使用必要的 API 凭证设置我们的环境。

OPENAI_API_KEY = userdata.get('OPEN_AI_KEY')
model_ID = userdata.get('GPT_MODEL')
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

client = OpenAI(api_key = OPENAI_API_KEY)

在这里，我们使用该userdata.get()函数来安全地访问 Google Colab 中的凭据。
我们检索想要gpt-4o在我们的用例中使用的特定 GPT 模型 ID。

使用这样的环境变量来设置我们的凭证可确保安全访问模型的凭证，同时保持我们选择模型的灵活性。

这也是管理 API 密钥和模型的更好方法，尤其是在不同环境或多个项目中工作时。

工作流程实施

现在，我们拥有了高效构建端到端工作流所需的所有资源。现在是时候开始每个工作流组件的技术实现，从数据处理辅助函数开始。

数据处理

我们工作流程的第一步是预处理 PDF 文件并提取其文本内容，这可以通过该extract_text_from_pdf函数实现。

它将 PDF 文件作为输入，并将其内容作为原始文本数据返回。

def extract_text_from_pdf(pdf_path: str):
    """
    Extract text content from a PDF file using the unstructured library.
    """
    elements = partition_pdf(pdf_path, strategy="hi_res")
    return "\n".join([str(element) for element in elements])

提示读者

提示存储在单独的.txt文件中，并使用以下函数加载。

def read_prompt(prompt_path: str):
    """
    Read the prompt for research paper parsing from a text file.
    """
    with open(prompt_path, "r") as f:
        return f.read()

元数据提取

这个功能其实是我们工作流程的核心，它利用 OpenAI API 来处理给定 PDF 文件的内容。

如果不使用装饰器，@retry我们可能会遇到这个Error Code 429 - Rate limit reached for requests问题。这主要发生在我们在处理过程中达到速率限制时。我们希望函数不会失败，而是不断尝试，直到成功达到目标。

@retry(wait=wait_random_exponential(min=1, max=120), stop=stop_after_attempt(10))
def completion_with_backoff(**kwargs):
    return client.chat.completions.create(**kwargs)

通过使用completion_with_backoff我们的extract_metadata函数：

它会等待 1 到 120 秒，然后重新运行失败的 API 调用。
上述等待时间随着每次重试而增加，但始终保持在 1 到 120 秒的范围内。
此过程称为指数退避，可用于管理 API 速率限制（包括临时问题）。

def extract_metadata(content: str, prompt_path: str, model_id: str):
    """
    Use GPT model to extract metadata from the research paper content based on the given prompt.
    """
    prompt_data = read_prompt(prompt_path)

    try:
        response = completion_with_backoff(
            model=model_id,
            messages=[
                {"role": "system", "content": prompt_data},
                {"role": "user", "content": content}
            ],
            temperature=0.2,
        )

        response_content = response.choices[0].message.content
        # Process and return the extracted metadata
        # ...
    except Exception as e:
        print(f"Error calling OpenAI API: {e}")
        return {}

通过将论文内容与提示一起发送，gpt-4o模型会提取提示中指定的结构化信息。

综合起来

通过将所有逻辑放在一起，我们可以使用该process_research_paper函数对单个 PDF 文件执行端到端执行，从提取预期的元数据到以格式保存最终结果.json。

def process_research_paper(pdf_path: str, prompt: str,
                           output_folder: str, model_id: str):
    """
    Process a single research paper through the entire pipeline.
    """
    print(f"Processing research paper: {pdf_path}")

    try:
        # Step 1: Extract text content from the PDF
        content = extract_text_from_pdf(pdf_path)

        # Step 2: Extract metadata using GPT model
        metadata = extract_metadata(content, prompt, model_id)

        # Step 3: Save the result as a JSON file
        output_filename = Path(pdf_path).stem + '.json'
        output_path = os.path.join(output_folder, output_filename)

        with open(output_path, 'w') as f:
            json.dump(metadata, f, indent=2)
        print(f"Saved metadata to {output_path}")

    except Exception as e:
        print(f"Error processing {pdf_path}: {e}")

以下是将该逻辑应用于单个文档处理的示例：

# Example for a single document

pdf_path = "./data/1706.03762v7.pdf"
prompt_path =  "./data/prompts/scientific_papers_prompt.txt"
output_folder = "./data/extracted_metadata"

process_research_paper(pdf_path, prompt_path, output_folder, model_ID)

PDf 文档的处理步骤（图片来自作者）

从上图我们可以看到，结果.json保存在与 PDF 名称完全相同但扩展名不同的./data/extracted_metadata/文件夹中。1706.0376v7.json

json 文件的内容如下所示，其中还给出了提取的目标属性突出显示的研究论文：

包含要提取的目标属性的原始论文（作者提供的图片）

从json数据中我们注意到所有属性都已成功提取。同样令人高兴的是，Illia Polosukhin论文中没有提供的机构，AI 将其留空。

{
  "PaperTitle": "Attention Is All You Need",
  "PublicationYear": "2017",
  "Authors": [
    "Ashish Vaswani",
    "Noam Shazeer",
    "Niki Parmar",
    "Jakob Uszkoreit",
    "Llion Jones",
    "Aidan N. Gomez",
    "Lukasz Kaiser",
    "Illia Polosukhin"
  ],
  "AuthorContact": [
    {
      "Name": "Ashish Vaswani",
      "Institution": "Google Brain",
      "Email": "avaswani@google.com"
    },
    {
      "Name": "Noam Shazeer",
      "Institution": "Google Brain",
      "Email": "noam@google.com"
    },
    {
      "Name": "Niki Parmar",
      "Institution": "Google Research",
      "Email": "nikip@google.com"
    },
    {
      "Name": "Jakob Uszkoreit",
      "Institution": "Google Research",
      "Email": "usz@google.com"
    },
    {
      "Name": "Llion Jones",
      "Institution": "Google Research",
      "Email": "llion@google.com"
    },
    {
      "Name": "Aidan N. Gomez",
      "Institution": "University of Toronto",
      "Email": "aidan@cs.toronto.edu"
    },
    {
      "Name": "Lukasz Kaiser",
      "Institution": "Google Brain",
      "Email": "lukaszkaiser@google.com"
    },
    {
      "Name": "Illia Polosukhin",
      "Institution": "",
      "Email": "illia.polosukhin@gmail.com"
    }
  ],
  "Abstract": "The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.",
  "SummaryAbstract": "The paper introduces the Transformer, a novel network architecture based solely on attention mechanisms, eliminating the need for recurrence and convolutions. The Transformer achieves superior performance on machine translation tasks, setting new state-of-the-art BLEU scores while being more parallelizable and requiring less training time. Additionally, it generalizes well to other tasks such as English constituency parsing."
}

此外，附加属性Summary Abstract的值如下所示，它完美地总结了初始摘要，同时保留了提示中提供的两到三句话的限制。

The paper introduces the Transformer, a novel network architecture based solely on attention mechanisms, eliminating the need for recurrence and convolutions. 
The Transformer achieves superior performance on machine translation tasks, setting new state-of-the-art BLEU scores while being more parallelizable and requiring less training time. 
Additionally, it generalizes well to other tasks such as English constituency parsin

现在管道适用于单个文档，我们可以实现对给定文件夹中的所有文档运行它的逻辑，这是使用函数实现的process_directory。

它处理每个文件并将其保存到同一个extracted_metadata文件夹中。

# Parse documents from a folder
def process_directory(prompt_path: str, directory_path: str, output_folder: str, model_id: str):
    """
    Process all PDF files in the given directory.
    """

    # Iterate through all files in the directory
    for filename in os.listdir(directory_path):
        if filename.lower().endswith('.pdf'):
            pdf_path = os.path.join(directory_path, filename)
            process_research_paper(pdf_path, prompt_path, output_folder, model_id)

以下是如何使用正确的参数调用该函数。

# Define paths
prompt_path = "./data/prompts/scientific_papers_prompt.txt"
directory_path = "./data"
output_folder = "./data/extracted_metadata"

process_directory(prompt_path, directory_path, output_folder, model_ID)

处理成功会显示如下信息，我们可以看到每篇研究论文都已经被处理了。

研究论文的处理步骤（图片来自作者）

与上述论文类似，YOLOv5 论文的最终json文件内容如下所示。

{
  "PaperTitle": "Performance Study of YOLOv5 and Faster R-CNN for Autonomous Navigation around Non-Cooperative Targets",
  "PublicationYear": "2022",
  "Authors": [
    "Trupti Mahendrakar",
    "Andrew Ekblad",
    "Nathan Fischer",
    "Ryan T. White",
    "Markus Wilde",
    "Brian Kish",
    "Isaac Silver"
  ],
  "AuthorContact": [
    {
      "Name": "Trupti Mahendrakar",
      "Institution": "Florida Institute of Technology",
      "Email": "tmahendrakar2020@my.fit.edu"
    },
    {
      "Name": "Andrew Ekblad",
      "Institution": "Florida Institute of Technology",
      "Email": "aekblad2019@my.fit.edu"
    },
    {
      "Name": "Nathan Fischer",
      "Institution": "Florida Institute of Technology",
      "Email": "nfischer2018@my.fit.edu"
    },
    {
      "Name": "Ryan T. White",
      "Institution": "Florida Institute of Technology",
      "Email": "rwhite@my.fit.edu"
    },
    {
      "Name": "Markus Wilde",
      "Institution": "Florida Institute of Technology",
      "Email": "mwilde@fit.edu"
    },
    {
      "Name": "Brian Kish",
      "Institution": "Florida Institute of Technology",
      "Email": "bkish@fit.edu"
    },
    {
      "Name": "Isaac Silver",
      "Institution": "Energy Management Aerospace",
      "Email": "isaac@energymanagementaero.com"
    }
  ],
  "Abstract": "Autonomous navigation and path-planning around non-cooperative space objects is an enabling technology for on-orbit servicing and space debris removal systems. The navigation task includes the determination of target object motion, the identification of target object features suitable for grasping, and the identification of collision hazards and other keep-out zones. Given this knowledge, chaser spacecraft can be guided towards capture locations without damaging the target object or without unduly the operations of a servicing target by covering up solar arrays or communication antennas. One way to autonomously achieve target identification, characterization and feature recognition is by use of artificial intelligence algorithms. This paper discusses how the combination of cameras and machine learning algorithms can achieve the relative navigation task. The performance of two deep learning-based object detection algorithms, Faster Region-based Convolutional Neural Networks (R-CNN) and You Only Look Once (YOLOv5), is tested using experimental data obtained in formation flight simulations in the ORION Lab at Florida Institute of Technology. The simulation scenarios vary the yaw motion of the target object, the chaser approach trajectory, and the lighting conditions in order to test the algorithms in a wide range of realistic and performance limiting situations. The data analyzed include the mean average precision metrics in order to compare the performance of the object detectors. The paper discusses the path to implementing the feature recognition algorithms and towards integrating them into the spacecraft Guidance Navigation and Control system.",
  "SummaryAbstract": "This paper evaluates the performance of two deep learning-based object detection algorithms, YOLOv5 and Faster R-CNN, for autonomous navigation around non-cooperative space objects. Experimental data from formation flight simulations were used to test the algorithms under various conditions. The study found that while Faster R-CNN is more accurate, YOLOv5 offers significantly faster inference times, making it more suitable for real-time applications."
}

AI 为初始摘要创建了以下摘要，再次，这看起来很棒！

This paper evaluates the performance of two deep learning-based object detection algorithms, YOLOv5 and Faster R-CNN, for autonomous navigation around non-cooperative space objects. 
Experimental data from formation flight simulations were used to test the algorithms under various conditions. 
The study found that while Faster R-CNN is more accurate, YOLOv5 offers significantly faster inference times, making it more suitable for real-time applications.