使用大型语言模型进行文档解析

动机

多年来，正则表达式一直是我解析文档的首选工具，我相信对于许多技术人员和行业也是如此。尽管正则表达式在某些情况下非常强大，但它们常常在面对真实世界文档的复杂性和多样性时缺少灵活性。

另一方面，大型语言模型提供了一种更强大、更灵活的方法来处理多种类型的文档结构和内容类型。

使用大语言模型处理文档流程

下面是一个常用的文档解析流程。为了简化问题，我们以研究论文处理的场景为例。

在这里插入图片描述

工作流程总体上具有三个主要组成部分：输入、处理和输出。
首先，提交文档，即PDF格式的科研论文进行处理。
处理组件的第一个模块从每个 PDF 中提取原始数据，并将其与包含大语言模型指令的提示相结合，以有效地提取数据。
然后，大语言模型使用提示来提取所有元数据。
对于每个 PDF，最终结果都以 JSON 格式保存，可用于进一步分析。

大语言模型相对于正则的优势

正则表达式（Regex）在处理研究论文结构的复杂性时存在显著的局限性，下面深入比较下这两种方法：

1、文档结构的灵活性

Regex 需要每个文档结构的特定模式，并且当给定文档偏离预期格式时会失败。
LLMs自动理解并适应各种文档结构，并且无论相关信息位于文档中的哪个位置，它们都能够识别相关信息。

2. 上下文理解

Regex 在不了解上下文或含义的情况下匹配模式。
LLMs 对每个文档的含义有更细致的了解，这使他们能够更准确地提取相关信息。

3. 维护和可扩展性

Regex随着文档格式的变化需要不断更新。添加对新信息类型的支持需要编写全新的正则表达式。
LLMs 可以轻松适应新的文档类型，只需对初始提示进行最小的更改，这使得它们更具可扩展性。

构建文档解析工作流程

上述理由足以采用 LLMs 来解析研究论文等复杂文档。

实验文档来自于

来自 Arxiv 网站的论文《Attention》
来自 Arxiv 网站的论文《BERT》

本节提供了利用大型语言模型构建现实世界文档解析系统的所有步骤，你可以直接在本地运行。

代码结构

project
   |
   |---Extract_Metadata_With_Large_Language_Models.ipynb
   |
  data
   |
   |---- extracted_metadata/
   |---- 1706.03762v7.pdf
   |---- 1810.04805.pdf
   |---- prompts
           |
           |------ scientific_papers_prompt.txt

project 文件夹是根文件夹，包含 data 文件夹和notebook
data文件夹中有两个文件夹，extracted_metadata和prompts，以及两篇论文。
extracted_metadata 当前为空，将包含 json 文件
prompts文件夹中有文本格式的提示

要提取的元数据

我们首先需要对需要提取的属性有一个明确的目标，为了简单起见，让我们重点关注我们场景的六个属性。

论文标题（Paper Title）
出版年份（Publication Year:）
作者（Authors）
作者联系方式（Author Contact）
摘要（Abstract）
概括摘要（Summary Abstract）

然后使用这些属性来定义提示，该提示清楚地解释了每个属性的含义以及最终输出的格式。文档的成功解析依赖于清晰解释每个属性含义以及以哪种格式提取最终结果的提示。

Scientific research paper:
---
{document}
---

You are an expert in analyzing scientific research papers. Please carefully read the provided research paper above and extract the following key information:

Extract these six (6) properties from the research paper:
- Paper Title: The full title of the research paper
- Publication Year: The year the paper was published
- Authors: The full names of all authors of the paper
- Author Contact: A list of dictionaries, where each dictionary contains the following keys for each author:
  - Name: The full name of the author
  - Institution: The institutional affiliation of the author
  - Email: The email address of the author (if provided)
- Abstract: The full text of the paper's abstract
- Summary Abstract: A concise summary of the abstract in 2-3 sentences, highlighting the key points

Guidelines:
- The extracted information should be factual and accurate to the document.
- Be extremely concise, except for the Abstract which should be copied in full.
- The extracted entities should be self-contained and easily understood without the rest of the paper.
- If any property is missing from the paper, please leave the field empty rather than guessing.
- For the Summary Abstract, focus on the main objectives, methods, and key findings of the research.
- For Author Contact, create an entry for each author, even if some information is missing. If an email or institution is not provided for an author, leave that field empty in the dictionary.

Answer in JSON format. The JSON should contain 6 keys: "PaperTitle", "PublicationYear", "Authors", "AuthorContact", "Abstract", and "SummaryAbstract". The "AuthorContact" should be a list of dictionaries as described above.

Prompt中有6大块内容，下面是对这6部分内容进行详细解释。

1、文档占位符

Scientific research paper:
---
{document}
---

使用 {} 符号定义，它指示将包含文档全文以供分析的位置。
2、角色指定

该模型被指定了一个角色，以便更好地执行任务，这在以下行中进行了定义，设置上下文并指示人工智能成为科学研究论文分析的专家。

You are an expert in analyzing scientific research papers.

3、提取指令

本节指定应从文档中提取的信息片段。


Extract these six (6) properties from the research paper:

4. 属性定义

此处定义了上述每个属性，其中包含要包含的信息及其格式策略的具体详细信息。例如，Author Contact 是包含其他详细信息的字典列表。

5、指导方针

这些指南告诉人工智能在提取过程中要遵循的规则，例如保持准确性以及如何处理丢失的信息。

6. 预期输出格式

这是最后一步，它指定回答时要考虑的确切格式，即 json 。


Answer in JSON format. The JSON should contain 6 keys: ...

安装必要的库

现在让我们开始安装必要的库。我们的文档解析系统是由多个库构建的，每个组件的主要库如下所示：

PDF 处理： pdfminer.six 、 PyPDF2 和 poppler-utils 用于处理各种 PDF 格式和结构。
文本提取： unstructured 及其依赖包（unstructured-inference 、 unstructured-pytesseract ）用于从文档中智能提取内容。
OCR 功能：tesseract-ocr 用于识别图像或扫描文档中的文本。
图像处理：pillow-heif 用于图像处理任务。
AI 集成：openai 库，用于在信息提取过程中利用 GPT 模型。

%%bash

pip -qqq install pdfminer.six
pip -qqq install pillow-heif==0.3.2
pip -qqq install matplotlib
pip -qqq install unstructured-inference
pip -qqq install unstructured-pytesseract
pip -qqq install tesseract-ocr
pip -qqq install unstructured
pip -qqq install openai
pip -qqq install PyPDF2

apt install -V tesseract-ocr
apt install -V libtesseract-dev

sudo apt-get update
apt-get install -V poppler-utils

安装成功后，导入如下：

import os
import re
import json
import openai
from pathlib import Path
from openai import OpenAI
from PyPDF2 import PdfReader
from google.colab import userdata
from unstructured.partition.pdf import partition_pdf
from tenacity import retry, wait_random_exponential, stop_after_attempt

设置凭据

在深入研究核心功能之前，我们需要使用必要的 API 凭据设置环境。

OPENAI_API_KEY = userdata.get('OPEN_AI_KEY')
model_ID = userdata.get('GPT_MODEL')
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

client = OpenAI(api_key = OPENAI_API_KEY)

在这里，我们使用 userdata.get() 函数安全地访问 Google Colab 中的凭据。
我们检索要使用的特定 GPT 模型 ID，在我们的用例中为 gpt-4o。

使用这样的环境变量来设置我们的凭据可确保对模型凭据的安全访问，同时保持我们选择模型的灵活性。它也是管理 API 密钥和模型的更好方法，尤其是在不同环境或多个项目中工作时。

工作流程处理

我们现在拥有有效构建端到端工作流程的所有资源。现在是时候开始每个工作流组件的技术实现了，从数据处理辅助函数开始。

1、数据处理

我们工作流程的第一步是预处理 PDF 文件并提取其文本内容，这是通过 extract_text_from_pdf 函数实现的。

它将 PDF 文件作为输入，并将其内容作为原始文本数据返回。

def extract_text_from_pdf(pdf_path: str):
    """
    Extract text content from a PDF file using the unstructured library.
    """
    elements = partition_pdf(pdf_path, strategy="hi_res")
    return "\n".join([str(element) for element in elements])

2、Prompt读取

提示存储在单独的 .txt 文件中，并使用以下函数加载。

def read_prompt(prompt_path: str):
    """
    Read the prompt for research paper parsing from a text file.
    """
    with open(prompt_path, "r") as f:
        return f.read()

3、元数据提取

这个函数实际上是我们工作流程的核心。它利用 OpenAI API 来处理给定 PDF 文件的内容。

如果不使用装饰器 @retry，我们可能会遇到 Error Code 429 - Rate limit reached for requests 问题。这主要发生在我们在处理过程中达到速率限制时。我们希望函数继续尝试，直到成功达到目标，而不是失败。

@retry(wait=wait_random_exponential(min=1, max=120), stop=stop_after_attempt(10))
def completion_with_backoff(**kwargs):
    return client.chat.completions.create(**kwargs)

通过在 extract_metadata 函数中使用 completion_with_backoff：

它会等待 1 到 120 秒，然后重新运行失败的 API 调用。
上述等待时间随着每次重试而增加，但始终保持在 1 到 120 秒的范围内。
此过程称为指数退避，对于管理 API 速率限制（包括临时问题）非常有用。

def extract_metadata(content: str, prompt_path: str, model_id: str):
    """
    Use GPT model to extract metadata from the research paper content based on the given prompt.
    """
    prompt_data = read_prompt(prompt_path)

    try:
        response = completion_with_backoff(
            model=model_id,
            messages=[
                {"role": "system", "content": prompt_data},
                {"role": "user", "content": content}
            ],
            temperature=0.2,
        )

        response_content = response.choices[0].message.content
        # Process and return the extracted metadata
        # ...
    except Exception as e:
        print(f"Error calling OpenAI API: {e}")
        return {}

通过随提示一起发送论文内容，gpt-4o 模型提取提示中指定的结构化信息。

完整代码

通过将所有逻辑放在一起，我们可以使用 process_research_paper 函数对单个 PDF 文件进行端到端执行，从提取预期的元数据到将最终结果保存为.json 格式。

def process_research_paper(pdf_path: str, prompt: str,
                           output_folder: str, model_id: str):
    """
    Process a single research paper through the entire pipeline.
    """
    print(f"Processing research paper: {pdf_path}")

    try:
        # Step 1: Extract text content from the PDF
        content = extract_text_from_pdf(pdf_path)

        # Step 2: Extract metadata using GPT model
        metadata = extract_metadata(content, prompt, model_id)

        # Step 3: Save the result as a JSON file
        output_filename = Path(pdf_path).stem + '.json'
        output_path = os.path.join(output_folder, output_filename)

        with open(output_path, 'w') as f:
            json.dump(metadata, f, indent=2)
        print(f"Saved metadata to {output_path}")

    except Exception as e:
        print(f"Error processing {pdf_path}: {e}")

以下是将逻辑应用于单个文档处理的示例：


# Example for a single document

pdf_path = "./data/1706.03762v7.pdf"
prompt_path =  "./data/prompts/scientific_papers_prompt.txt"
output_folder = "./data/extracted_metadata"

process_research_paper(pdf_path, prompt_path, output_folder, model_ID)

在这里插入图片描述

从上图中，我们可以看到生成的 .json 保存在 ./data/extracted_metadata/ 文件夹中，名称为 1706.0376v7.json，与 PDF 的名称完全相同，但具有不同的扩展名。

下面给出了 json 文件的内容以及突出显示的研究论文，其中突出显示了已提取的目标属性：

在这里插入图片描述

从 json 数据中我们注意到所有属性都已成功提取。更棒的是，论文中没有提供 Illia Polosukhin 的机构，人工智能将其保留为空白字段。

{
  "PaperTitle": "Attention Is All You Need",
  "PublicationYear": "2017",
  "Authors": [
    "Ashish Vaswani",
    "Noam Shazeer",
    "Niki Parmar",
    "Jakob Uszkoreit",
    "Llion Jones",
    "Aidan N. Gomez",
    "Lukasz Kaiser",
    "Illia Polosukhin"
  ],
  "AuthorContact": [
    {
      "Name": "Ashish Vaswani",
      "Institution": "Google Brain",
      "Email": "avaswani@google.com"
    },
    {
      "Name": "Noam Shazeer",
      "Institution": "Google Brain",
      "Email": "noam@google.com"
    },
    {
      "Name": "Niki Parmar",
      "Institution": "Google Research",
      "Email": "nikip@google.com"
    },
    {
      "Name": "Jakob Uszkoreit",
      "Institution": "Google Research",
      "Email": "usz@google.com"
    },
    {
      "Name": "Llion Jones",
      "Institution": "Google Research",
      "Email": "llion@google.com"
    },
    {
      "Name": "Aidan N. Gomez",
      "Institution": "University of Toronto",
      "Email": "aidan@cs.toronto.edu"
    },
    {
      "Name": "Lukasz Kaiser",
      "Institution": "Google Brain",
      "Email": "lukaszkaiser@google.com"
    },
    {
      "Name": "Illia Polosukhin",
      "Institution": "",
      "Email": "illia.polosukhin@gmail.com"
    }
  ],
  "Abstract": "The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-to-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.",
  "SummaryAbstract": "The paper introduces the Transformer, a novel network architecture based solely on attention mechanisms, eliminating the need for recurrence and convolutions. The Transformer achieves superior performance on machine translation tasks, setting new state-of-the-art BLEU scores while being more parallelizable and requiring less training time. Additionally, it generalizes well to other tasks such as English constituency parsing."
}

此外，附加属性 Summary Abstract 的值如下所示，它完美地总结了最初的摘要，同时保持在提示中提供的两到三个句子约束内。

The paper introduces the Transformer, a novel network architecture based solely on attention mechanisms, eliminating the need for recurrence and convolutions. 
The Transformer achieves superior performance on machine translation tasks, setting new state-of-the-art BLEU scores while being more parallelizable and requiring less training time. 
Additionally, it generalizes well to other tasks such as English constituency parsin

现在pipeline适用于单个文档，我们可以实现逻辑来对给定文件夹中的所有文档运行它，这是使用 process_directory 函数实现的。它处理每个文件并将其保存到同一个 extracted_metadata 文件夹中。

# Parse documents from a folder
def process_directory(prompt_path: str, directory_path: str, output_folder: str, model_id: str):
    """
    Process all PDF files in the given directory.
    """

    # Iterate through all files in the directory
    for filename in os.listdir(directory_path):
        if filename.lower().endswith('.pdf'):
            pdf_path = os.path.join(directory_path, filename)
            process_research_paper(pdf_path, prompt_path, output_folder, model_id)

以下是如何使用正确的参数调用该函数。

# Define paths
prompt_path = "./data/prompts/scientific_papers_prompt.txt"
directory_path = "./data"
output_folder = "./data/extracted_metadata"

process_directory(prompt_path, directory_path, output_folder, model_ID)

处理成功显示如下信息，我们可以看到每篇研究论文都已被处理。

结论

本文简要概述了LLM在复杂文档元数据提取中的应用，提取的json数据可以存储在非关系数据库中以供进一步分析。LLM 和正则表达式在内容提取方面各有优缺点，每一种都应根据用例明智地应用。