Elasticsearch：使用 Azure AI 文档智能解析 PDF 文本和表格数据

作者：来自 Elastic James Williams

了解如何使用 Azure AI 文档智能解析包含文本和表格数据的 PDF 文档。

Azure AI 文档智能是一个强大的工具，用于从 PDF 中提取结构化数据。它可以有效地提取文本和表格数据。提取的数据可以索引到 Elastic Cloud Serverless，以支持 RAG（Retrieval Augmented Generation - 检索增强生成）。

在这篇博客中，我们将通过摄取四份最新的 Elastic N.V. 季度报告来演示 Azure AI 文档智能的强大功能。这些 PDF 文档的页数从 43 页到 196 页不等，每个 PDF 都包含文本和表格数据。我们将使用以下提示测试表格数据的检索：比较/对比 Q2-2025、Q1-2025、Q4-2024 和 Q3-2024 的订阅收入？

这个提示比较复杂，因为它需要来自四个不同 PDF 的上下文，这些 PDF 中的相关信息以表格格式呈现。

让我们通过一个端到端的参考示例来了解，这个示例由两个主要部分组成：

Python 笔记本

下载四个季度的 Elastic N.V. 10-Q 文件 PDF
使用 Azure AI 文档智能解析每个 PDF 文件中的文本和表格数据
将文本和表格数据输出到 JSON 文件
将 JSON 文件摄取到 Elastic Cloud Serverless

Elastic Cloud Serverless

为 PDF 文本 + 表格数据创建向量嵌入
为 RAG 提供向量搜索数据库查询
预配置的 OpenAI 连接器用于 LLM 集成
A/B 测试界面用于与 10-Q 文件进行对话

前提条件

此笔记本中的代码块需要 Azure AI Document Intelligence 和 Elasticsearch 的 API 密钥。Azure AI Document Intelligence 的最佳起点是创建一个 Document Intelligence 资源。对于 Elastic Cloud Serverless，请参考入门指南。你需要 Python 3.9+ 来运行这些代码块。

创建 .env 文件

将 Azure AI Document Intelligence 和 Elastic Cloud Serverless 的密钥放入 .env 文件中。

AZURE_AI_DOCUMENT_INTELLIGENCE_ENDPOINT=YOUR_AZURE_RESOURCE_ENDPOINT
AZURE_AI_DOCUMENT_INTELLIGENCE_API_KEY=YOUR_AZURE_RESOURCE_API_KEY

ES_URL=YOUR_ES_URL
ES_API_KEY=YOUR_ES_API_KEY

安装 Python 包

!pip install elasticsearch python-dotenv tqdm azure-core azure-ai-documentintelligence requests httpx

创建输入和输出文件夹

import os

input_folder_pdf = "./pdf"
output_folder_pdf = "./json"

folders = [input_folder_pdf, output_folder_pdf]

def create_folders_if_not_exist(folders):
    for folder in folders:
        os.makedirs(folder, exist_ok=True)
        print(f"Folder '{folder}' created or already exists.")

create_folders_if_not_exist(folders)

下载 PDF 文件

下载四个最近的 Elastic 10-Q 季度报告。如果你已经有了 PDF 文件，可以将它们放在 ‘./pdf’ 文件夹中。

import os
import requests

def download_pdf(url, directory='./pdf', filename=None):
    if not os.path.exists(directory):
        os.makedirs(directory)
    
    response = requests.get(url)
    if response.status_code == 200:
        if filename is None:
            filename = url.split('/')[-1]
        filepath = os.path.join(directory, filename)
        with open(filepath, 'wb') as file:
            file.write(response.content)
        print(f"Downloaded {filepath}")
    else:
        print(f"Failed to download file from {url}")

print("Downloading 4 recent 10-Q reports for Elastic NV.")
base_url = 'https://s201.q4cdn.com/217177842/files/doc_financials'
download_pdf(f'{base_url}/2025/q2/e5aa7a0a-6f56-468d-a5bd-661792773d71.pdf',      filename='elastic-10Q-Q2-2025.pdf')
download_pdf(f'{base_url}/2025/q1/18656e06-8107-4423-8e2b-6f2945438053.pdf', filename='elastic-10Q-Q1-2025.pdf')
download_pdf(f'{base_url}/2024/q4/9949f03b-09fb-4941-b105-62a304dc1411.pdf', filename='elastic-10Q-Q4-2024.pdf')
download_pdf(f'{base_url}/2024/q3/7e60e3bd-ff50-4ae8-ab12-5b3ae19420e6.pdf', filename='elastic-10Q-Q3-2024.pdf')

使用 Azure AI Document Intelligence 解析 PDF

在解析 PDF 文件的代码块中有很多内容。以下是简要总结：

设置 Azure AI Document Intelligence 导入和环境变量
使用 AnalyzeResult 解析 PDF 段落
使用 AnalyzeResult 解析 PDF 表格
结合 PDF 段落和表格数据
通过对每个 PDF 文件执行 1-4 步，整合所有结果并将其存储为 JSON

设置 Azure AI Document Intelligence 导入和环境变量

最重要的导入是 AnalyzeResult。这个类表示文档分析的结果，并包含关于文档的详细信息。我们关心的细节包括页面、段落和表格。

import os
from azure.core.credentials import AzureKeyCredential
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.ai.documentintelligence.models import AnalyzeResult
from azure.ai.documentintelligence.models import AnalyzeDocumentRequest
import json
from dotenv import load_dotenv
from tqdm import tqdm

load_dotenv()

AZURE_AI_DOCUMENT_INTELLIGENCE_ENDPOINT =  os.getenv('AZURE_AI_DOCUMENT_INTELLIGENCE_ENDPOINT')
AZURE_AI_DOCUMENT_INTELLIGENCE_API_KEY = os.getenv('AZURE_AI_DOCUMENT_INTELLIGENCE_API_KEY')

使用 AnalyzeResult 解析 PDF 段落

从每个页面提取段落文本。不要提取表格数据。

def parse_paragraphs(analyze_result):
    table_offsets = []
    page_content = {}

    for paragraph in analyze_result.paragraphs:  
        for span in paragraph.spans:
            if span.offset not in table_offsets:
                for region in paragraph.bounding_regions:
                    page_number = region.page_number
                    if page_number not in page_content:
                        page_content[page_number] = []
                    page_content[page_number].append({
                        "content_text": paragraph.content
                    })
    return page_content, table_offsets

使用 AnalyzeResult 解析 PDF 表格

从每个页面提取表格内容。不要提取段落文本。这个技术最有趣的副作用是，无需转换表格数据。LLM 知道如何读取看起来像 “单元格 [0, 1]：表格数据……” 的文本。

def parse_tables(analyze_result, table_offsets):
    page_content = {}

    for table in analyze_result.tables:
        table_data = []
        for region in table.bounding_regions:
            page_number = region.page_number
            for cell in table.cells:
                for span in cell.spans:
                    table_offsets.append(span.offset)
                table_data.append(f"Cell [{cell.row_index}, {cell.column_index}]: {cell.content}")

        if page_number not in page_content:
            page_content[page_number] = []
        
        page_content[page_number].append({
        "content_text": "\n".join(table_data)})
    
    return page_content

结合 PDF 段落和表格数据

在页面级别进行预处理分块以保留上下文，这样我们可以轻松手动验证 RAG 检索。稍后，你将看到，这种预处理分块不会对 RAG 输出产生负面影响。

def combine_paragraphs_tables(filepath, paragraph_content, table_content):
    page_content_concatenated = {}
    structured_data = []

    # Combine paragraph and table content
    for p_number in set(paragraph_content.keys()).union(table_content.keys()):
        concatenated_text = ""

        if p_number in paragraph_content:
            for content in paragraph_content[p_number]:
                concatenated_text += content["content_text"] + "\n"

        if p_number in table_content:
            for content in table_content[p_number]:
                concatenated_text += content["content_text"] + "\n"
        
        page_content_concatenated[p_number] = concatenated_text.strip()

    # Append a single item per page to the structured_data list
    for p_number, concatenated_text in page_content_concatenated.items():
        structured_data.append({
            "page_number": p_number,
            "content_text": concatenated_text,
            "pdf_file": os.path.basename(filepath)
        })

    return structured_data

把所有内容结合在一起

打开 ./pdf 文件夹中的每个 PDF，解析文本和表格数据，并将结果保存为 JSON 文件，该文件包含 page_number、content_text 和 pdf_file 字段。content_text 字段表示每个页面的段落和表格数据。

pdf_files = [
    os.path.join(input_folder_pdf, file)
    for file in os.listdir(input_folder_pdf)
    if file.endswith(".pdf")
]

document_intelligence_client = DocumentIntelligenceClient(
    endpoint=AZURE_AI_DOCUMENT_INTELLIGENCE_ENDPOINT, 
    credential=AzureKeyCredential(AZURE_AI_DOCUMENT_INTELLIGENCE_API_KEY),
    connection_timeout=600 
)

for filepath in tqdm(pdf_files, desc="Parsing PDF files"):
    with open(filepath, "rb") as file:
        poller = document_intelligence_client.begin_analyze_document("prebuilt-layout",
            AnalyzeDocumentRequest(bytes_source=file.read())
        )

        analyze_result: AnalyzeResult = poller.result()
        
        paragraph_content, table_offsets = parse_paragraphs(analyze_result)
        table_content = parse_tables(analyze_result, table_offsets)
        structured_data = combine_paragraphs_tables(filepath, paragraph_content, table_content)

        # Convert the structured data to JSON format
        json_output = json.dumps(structured_data, indent=4)
        
        # Get the filename without the ".pdf" extension
        filename_without_ext = os.path.splitext(os.path.basename(filepath))[0]
        # Write the JSON output to a file
        output_json_file = f"{output_folder_pdf}/{filename_without_ext}.json"

        with open(output_json_file, "w") as json_file:
            json_file.write(json_output)

加载数据到 Elastic Cloud Serverless

以下代码块处理：

设置 Elasticsearch 客户端和环境变量的导入
在 Elastic Cloud Serverless 中创建索引
将 ./json 目录中的 JSON 文件加载到 pdf-chat 索引中

设置 Elasticsearch 客户端和环境变量的导入

最重要的导入是 Elasticsearch。这个类负责连接到 Elastic Cloud Serverless，创建并填充 pdf-chat 索引。

import json
from dotenv import load_dotenv
from elasticsearch import Elasticsearch
from tqdm import tqdm
import os

load_dotenv()

ES_URL = os.getenv('ES_URL')
ES_API_KEY = os.getenv('ES_API_KEY')

es = Elasticsearch(hosts=ES_URL,api_key=ES_API_KEY, request_timeout=300)

在 Elastic Cloud Serverless 中创建索引

此代码块创建一个名为 “pdf_chat” 的索引，并具有以下映射：

page_content - 用于通过全文搜索测试 RAG
page_content_sparse - 用于通过稀疏向量测试 RAG
page_content_dense - 用于通过密集向量测试 RAG
page_number - 对于构建引用很有用
pdf_file - 对于构建引用很有用

注意使用了 copy_to 和 semantic_text。copy_to 工具将 body_content 复制到两个语义文本（semantic_text）字段。每个语义文本字段都映射到一个 ML 推理端点，一个用于稀疏向量，一个用于密集向量。由 Elastic 提供的 ML 推理会自动将每页分成 250 个 token 的块，并有 100 个 token 的重叠。

index_name= "pdf-chat"
index_body = {
    "mappings": {
        "properties": {

            "page_content": 
                {"type": "text", 
                    "copy_to": ["page_content_sparse",
                                "page_content_dense"]},
            
            "page_content_sparse": 
                {"type": "semantic_text", 
                    "inference_id": ".elser-2-elasticsearch"},
            
            "page_content_dense": 
                {"type": "semantic_text", 
                    "inference_id": ".multilingual-e5-small-elasticsearch"},
            
            "page_number": {"type": "text"},
            
            "pdf_file": {
                "type": "text", "fields": {"keyword": {"type": "keyword"}}
            }
        }
    }
}

if es.indices.exists(index=index_name):
    es.indices.delete(index=index_name)
    print(f"Index '{index_name}' deleted successfully.")

response = es.indices.create(index=index_name, body=index_body)
if 'acknowledged' in response and response['acknowledged']:
    print(f"Index '{index_name}' created successfully.")
elif 'error' in response:
    print(f"Failed to create: '{index_name}'") 
    print(f"Error: {response['error']['reason']}")
else:
    print(f"Index '{index_name}' already exists.")

将 JSON 文件从 `./json` 目录加载到 `pdf-chat` 索引

此过程将花费几分钟时间，因为我们需要：

加载 402 页 PDF 数据
为每个 page_content 块创建稀疏文本嵌入
为每个 page_content 块创建密集文本嵌入

files = os.listdir(output_folder_pdf)
with tqdm(total=len(files), desc="Indexing PDF docs") as pbar_files:
    for file in files:
        with open(output_folder_pdf + "/" + file) as f:
            data = json.loads(f.read())
        
        with tqdm(total=len(data), desc=f"Processing {file}") as pbar_pages:
            for page in data:
                doc = {
                    "page_content": page['content_text'],
                    "page_number": page['page_number'],
                    "pdf_file": page['pdf_file']
                }
                id = f"{page['pdf_file']}_{page['page_number']}"
                es.index(index=index_name, id=id, body=json.dumps(doc))
                pbar_pages.update(1)
        
        pbar_files.update(1)

最后还有一个代码技巧需要提到。我们将通过以下命名约定设置 Elastic 文档 ID：FILENAME_PAGENUMBER。这样可以方便地查看与引用关联的 PDF 文件和页面号码，在 Playground 中进行验证。

Elastic Cloud Serverless

Elastic Cloud Serverless 是原型化新 Retrieval-Augmented Generation (RAG) 系统的绝佳选择，因为它提供了完全托管的可扩展基础设施，避免了手动集群管理的复杂性。它开箱即用地支持稀疏和密集向量搜索，使你能够高效地实验不同的检索策略。借助内置的语义文本嵌入、相关性排名和混合搜索功能，Elastic Cloud Serverless 加速了搜索驱动应用程序的迭代周期。

借助 Azure AI Document Intelligence 和一些 Python 代码，我们准备好了看看是否能让 LLM 在真实数据的基础上回答问题。让我们打开 Playground，并使用不同的查询策略进行一些手动 A/B 测试。