AI模型部署：Triton+Marker部署PDF转markdown服务

前言

在知识库场景下往往需要对PDF文档进行解析，从而能够通过RAG完成知识检索，本文介绍开源的PDF转Markdown工具marker，并借助Triton Inference Server将其服务化。

内容摘要

知识库场景下pdf解析简述
Marker简介和安装
Marker快速开始
使用Triton服务化

知识库场景下pdf解析简述

PDF文档通常包含多样化的格式、图片、表格等元素，由于RAG对数据的标准化和准确性有很高的依赖性，直接将PDF转化为text容易丢失和混淆文件中内容的组织形式，一种更优的方式是将PDF转化为Markdown，它能够更好的留结内容的构化信息。
以解析《Attention is all you need》这篇PDF论文为例，原始PDF如下

《Attention is all you need》的PDF

转化为text的结果如下

Attention Is All You Need
AshishVaswani∗ NoamShazeer∗ NikiParmar∗ JakobUszkoreit∗
GoogleBrain GoogleBrain GoogleResearch GoogleResearch
avaswani@google.com noam@google.com nikip@google.com usz@google.com
7102
LlionJones∗ AidanN.Gomez∗ † ŁukaszKaiser∗
GoogleResearch UniversityofToronto GoogleBrain
llion@google.com aidan@cs.toronto.edu lukaszkaiser@google.com
ceD
IlliaPolosukhin∗ ‡
illia.polosukhin@gmail.com 6
]LC.sc[
Abstract
Thedominantsequencetransductionmodelsarebasedoncomplexrecurrentor
convolutionalneuralnetworksthatincludeanencoderandadecoder. Thebest
performing models also connect the encoder and decoder through an attention
5v26730.6071:viXra

而转化为Markdown的结果如下

# Attention Is All You Need
| Ashish Vaswani∗ Google Brain   |                                                 |
|--------------------------------|-------------------------------------------------|
| avaswani@google.com            | Noam Shazeer∗ Google Brain                      |
| noam@google.com                | Niki Parmar∗                                    |
| Google Research                |                                                 |
| nikip@google.com               | Jakob Uszkoreit∗ Google Research usz@google.com |
Łukasz Kaiser∗
Google Brain lukaszkaiser@google.com Aidan N. Gomez∗ †
University of Toronto aidan@cs.toronto.edu Illia Polosukhin∗ ‡
illia.polosukhin@gmail.com
## Abstract
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 Englishto-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

text无法恢复换行的连续结构，上一行和下一行断开，而Markdown会将其解析为完整的一段；如果PDF的结构稍微复杂一点，text就会将不同位置上完成不相关的文字解析合并在一起，比如例子中的“7102”是论文左侧的发表时间，实际为2017年，最后Markdown相比于text能识别出层次结构，比如表格、标题等，整体而言Markdown解析的质量更高。

marker简介和安装

marker是github上一个一个基于Python语言实现的开源的项目，它基于多个OCR模型的组合流水线来完成PDF转Markdown的任务，模型包括

ORC文字提取
页面布局和阅读顺序识别
分模块的清洗和格式化
模型合并和后处理

使用pip可以安装marker

pip install marker-pdf

安装完之后在环境变量路径下会安装对应的转化工具marker_single

$ which marker_single 
/home/anaconda3/envs/my_env/bin/marker_single

额外的marker实际上是调用了众多的模型对PDF进行识别和推理，因此需要下载一些模型文件，marker默认在使用的时候下载，我们先在HuggingFace上先离线下载好所有需要的模型，放在vikp目录下，所需的模型文件如下

[root@xxxx vikp]# ls -lt
总用量 0
drwxr-xr-x 2 root root 132 5月  14 16:37 surya_order
drwxr-xr-x 2 root root  10 5月  14 15:35 order_bench
drwxr-xr-x 2 root root  10 5月  14 15:32 publaynet_bench
drwxr-xr-x 2 root root  10 5月  14 15:32 rec_bench
drwxr-xr-x 2 root root 229 5月  14 15:31 surya_rec
drwxr-xr-x 2 root root  10 5月  14 15:28 doclaynet_bench
drwxr-xr-x 2 root root  98 5月  14 15:27 surya_det_math
drwxr-xr-x 2 root root  98 5月  14 15:26 surya_det2
drwxr-xr-x 2 root root 159 5月  14 15:20 pdf_postprocessor_t5
drwxr-xr-x 2 root root  98 5月  14 15:18 surya_layout2
drwxr-xr-x 2 root root 319 5月  14 15:17 texify

以surya_order为例，在HuggingFace都能够找到对应的模型

marker模型下载准备

marker快速开始

使用环境变量下的marker_single命令即可运行marker，输入为单篇PDF文档的位置，输出为一个结果目录，先切换到上一层目录，确保vikp文件夹在当前执行目录的同一级

root@1fc83e178b80:/home/marker-pdf/1# marker_single 606addeff4c0070ce300ff0adc88eceb.pdf ./ --batch_multiplier 2 --max_pages 1 --langs Chinese
Loading detection model vikp/surya_det2 on device cuda with dtype torch.float16
Loading detection model vikp/surya_layout2 on device cuda with dtype torch.float16
Loading reading order model vikp/surya_order on device cuda with dtype torch.float16
Loaded texify model to cuda with torch.float16 dtype
Detecting bboxes: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.22s/it]
Detecting bboxes: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  9.16it/s]
Finding reading order: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.44s/it]
Saved markdown to the ./606addeff4c0070ce300ff0adc88eceb folder

日志表明marker分别加载了三个模型到GPU，然后将推理的结果写到了606addeff4c0070ce300ff0adc88eceb目录，目录下有识别出的图片，Markdown文件，以及配置文件

识别目录下内容

打开Markdown，识别的内容如下

## 创业板投资风险提示

本次股票发行后拟在创业板市场上市,该市场具有较高的投资风险。创业板公司 具有创新投入大、新旧产业融合成功与否存在不确定性、尚处于成长期、经营风险高、 业绩不稳定、退市风险高等特点,投资者面临较大的市场风险。投资者应充分了解创 业板市场的投资风险及本公司所披露的风险因素,审慎作出投资决定。

湖北亨迪药业股份有限公司

![0_image_0.png](0_image_0.png)

![0_image_1.png](0_image_1.png)

Hubei Biocause Heilen Pharmaceutical Co., Ltd.
(荆门高新区·掇刀区杨湾路 122 号)

![0_image_2.png](0_image_2.png)

![0_image_3.png](0_image_3.png)

首次公开发行股票并在创业板上市 招股说明书 保荐人(主承销商)
(中国(上海)自由贸易试验区商城路 618 号)

使用Triton服务化

使用命令行的方式不能实现跨机器跨语言的场景，因此需要将marker服务化，marker本质上是torch模型组成的pipeline，因此很适合使用Triton Inference Server进行部署，最终暴露出HTTP API服务用于调用。
首先用Docker拉取Triton基础镜像nvcr.io/nvidia/tritonserver:23.10-py3，并在其中pip安装marker-pdf包，安装完成后重新commit为新容器，例如命名为tritonserver:marker-pdf-env。
然后设置模型的目录结构，Triton的模型统一存放在model_repository目录下，在model_repository下创建marker-pdf目录，其结构如下

[root@zx-61 marker-pdf]# tree
.
├── 1
│   ├── model.py
│   ├── vikp
│   │   ├── doclaynet_bench
│   │   ├── order_bench
│   │   ├── pdf_postprocessor_t5
│   │   ├── publaynet_bench
│   │   ├── rec_bench
│   │   ├── surya_det2
│   │   ├── surya_det_math
│   │   ├── surya_layout2
│   │   ├── surya_order
│   │   ├── surya_rec
│   │   └── texify
└── config.pbtxt

1目录代表模型版本，其下有后端逻辑代码model.py以及所需要的ORC模型目录vikp，config.pbtxt为模型服务的配置文件，里面定义了输入和输出，设备资源配置等，其内容如下

[root@zx-61 marker-pdf]# cat config.pbtxt
name: "marker-pdf"
backend: "python"

max_batch_size: 0
input [
    {
        name: "text"
        dims: [ -1 ]
        data_type: TYPE_STRING
    },
    {
        name: "max_pages"
        dims: [ 1 ]
        data_type: TYPE_INT64
    }
]
output [
    {
        name: "output"
        dims: [ -1 ]
        data_type: TYPE_STRING
    }
]

instance_group [
{
  count: 1
  kind: KIND_GPU
  gpus: [ 0 ]
}
]

输入参数为text和max_pages，分别代表PDF的二进制文件经过base64编码之后的字符串内容，以及marker转化的最大页数，比如max_pages设置为5则转化PDF的前5页。输出字段为output，直接输出Markdown的字符串内容，不需要其他图片等信息。
本质上是对marker_single命令的服务话，而marker_single是调用的marker.convert下的convert_single_pdf，对其稍作修改，将输入改为PDF的二进制文件经过base64编码之后的字符串，将输入只取Markdown的内容即可，model.py内容如下

import os

# 设置显存空闲block最大分割阈值
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'max_split_size_mb:32'
# 设置work目录

os.environ['TRANSFORMERS_CACHE'] = os.path.dirname(os.path.abspath(__file__)) + "/work/"
os.environ['HF_MODULES_CACHE'] = os.path.dirname(os.path.abspath(__file__)) + "/work/"
# os.environ["CUDA_VISIBLE_DEVICES"] = '0,1,2'

import gc
import json
import base64

import torch
import numpy as np
from marker.convert import convert_single_pdf
from marker.logger import configure_logging
from marker.models import load_all_models

import triton_python_backend_utils as pb_utils

gc.collect()


class TritonPythonModel:
    def initialize(self, args):
        device = "cuda" if args["model_instance_kind"] == "GPU" else "cpu"
        device_id = args["model_instance_device_id"]
        self.device = f"{device}:{device_id}"
        # output config
        self.model_config = json.loads(args['model_config'])
        output_config = pb_utils.get_output_config_by_name(self.model_config, "output")
        self.output_response_dtype = pb_utils.triton_string_to_numpy(output_config['data_type'])
        # load model
        configure_logging()
        self.model_lst = load_all_models(device=self.device, dtype=torch.float16)

    def execute(self, requests):
        responses = []
        for request in requests:
            text = pb_utils.get_input_tensor_by_name(request, "text").as_numpy().astype("S")
            text = np.char.decode(text, "utf-8").tolist()[0]
            max_pages = pb_utils.get_input_tensor_by_name(request, "max_pages").as_numpy()[0]
            fname = base64.b64decode(text)
            full_text, images, out_meta = convert_single_pdf(fname, self.model_lst, max_pages=max_pages,
                                                             langs=['Chinese'], batch_multiplier=2)
            response = np.char.encode(np.array(full_text))
            response_output_tensor = pb_utils.Tensor("output", response.astype(self.output_response_dtype))

            final_inference_response = pb_utils.InferenceResponse(output_tensors=[response_output_tensor])
            responses.append(final_inference_response)

        return responses

    def finalize(self):
        print('Cleaning up...')

核心的推理过程为

full_text, images, out_meta = convert_single_pdf(fname, self.model_lst, max_pages=max_pages,
                                                             langs=['Chinese'], batch_multiplier=2)

其中full_text就是Markdown的结果。
下一步启动Triton服务，注意使用-w设置容器内的执行目录到vikp所在的同一级目录

docker run --rm --gpus=all --shm-size=1g -p18999:8000 -p18998:8001 -p18997:8002 \
-e PYTHONIOENCODING=utf-8 -w /models/marker-pdf/1 \
-v /home/model_repository/:/models \
tritonserver:marker-pdf-env \
tritonserver --model-repository=/models --model-control-mode explicit --load-model marker-pdf --log-format ISO8601

启动成功后使用Python请求调用服务测试

import time
import base64
import requests
import json

data = base64.b64encode(open("/home/桌面/论文/1706.03762.pdf", "rb").read()).decode("utf-8")
url = "http://10.2.13.31:18999/v2/models/marker-pdf/infer"
raw_data = {
    "inputs": [{"name": "text", "datatype": "BYTES", "shape": [1], "data": [data]},
               {"name": "max_pages", "datatype": "INT64", "shape": [1], "data": [1]}],
    "outputs": [{"name": "output", "shape": [1]}]
}
t1 = time.time()
res = requests.post(url, json.dumps(raw_data, ensure_ascii=True), headers={"Content_Type": "application/json"},
                    timeout=2000)
t2 = time.time()
print(t2 - t1)

print(res.json()["outputs"][0]["data"][0])

客户端先读取PDF文件转化为二进制的base64编码字符串，请求结果打印如下

# Attention Is All You Need
| Ashish Vaswani∗ Google Brain   |                                                 |
|--------------------------------|-------------------------------------------------|
| avaswani@google.com            | Noam Shazeer∗ Google Brain                      |
| noam@google.com                | Niki Parmar∗                                    |
| Google Research                |                                                 |
| nikip@google.com               | Jakob Uszkoreit∗ Google Research usz@google.com |
Łukasz Kaiser∗
Google Brain lukaszkaiser@google.com Aidan N. Gomez∗ †
University of Toronto aidan@cs.toronto.edu Illia Polosukhin∗ ‡
illia.polosukhin@gmail.com
## Abstract
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 Englishto-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
## 1 Introduction
Recurrent neural networks, long short-term memory [13] and gated recurrent [7] neural networks in particular, have been firmly established as state of the art approaches in sequence modeling and
∗Equal contribution. Listing order is random. Jakob proposed replacing RNNs with self-attention and started the effort to evaluate this idea. Ashish, with Illia, designed and implemented the first Transformer models and has been crucially involved in every aspect of this work. Noam proposed scaled dot-product attention, multi-head attention and the parameter-free position representation and became the other person involved in nearly every detail. Niki designed, implemented, tuned and evaluated countless model variants in our original codebase and tensor2tensor. Llion also experimented with novel model variants, was responsible for our initial codebase, and efficient inference and visualizations. Lukasz and Aidan spent countless long days designing various parts of and implementing tensor2tensor, replacing our earlier codebase, greatly improving results and massively accelerating our research.
†Work performed while at Google Brain.
‡Work performed while at Google Research.
| Llion Jones∗     |
|------------------|
| Google Research  |
| llion@google.com |

如何系统的去学习大模型LLM ？

作为一名热心肠的互联网老兵，我意识到有很多经验和知识值得分享给大家，也可以通过我们的能力和经验解答大家在人工智能学习中的很多困惑，所以在工作繁忙的情况下还是坚持各种整理和分享。

但苦于知识传播途径有限，很多互联网行业朋友无法获得正确的资料得到学习提升，故此将并将重要的 AI大模型资料 包括AI大模型入门学习思维导图、精品AI大模型学习书籍手册、视频教程、实战学习等录播视频免费分享出来。

😝有需要的小伙伴，可以V扫描下方二维码免费领取🆓

一、全套AGI大模型学习路线

AI大模型时代的学习之旅：从基础到前沿，掌握人工智能的核心技能！

二、640套AI大模型报告合集

这套包含640份报告的合集，涵盖了AI大模型的理论研究、技术实现、行业应用等多个方面。无论您是科研人员、工程师，还是对AI大模型感兴趣的爱好者，这套报告合集都将为您提供宝贵的信息和启示。

三、AI大模型经典PDF籍

随着人工智能技术的飞速发展，AI大模型已经成为了当今科技领域的一大热点。这些大型预训练模型，如GPT-3、BERT、XLNet等，以其强大的语言理解和生成能力，正在改变我们对人工智能的认识。那以下这些PDF籍就是非常不错的学习资源。

在这里插入图片描述

四、AI大模型商业化落地方案

阶段1：AI大模型时代的基础理解

目标：了解AI大模型的基本概念、发展历程和核心原理。
内容：
- L1.1 人工智能简述与大模型起源
- L1.2 大模型与通用人工智能
- L1.3 GPT模型的发展历程
- L1.4 模型工程
- L1.4.1 知识大模型
- L1.4.2 生产大模型
- L1.4.3 模型工程方法论
- L1.4.4 模型工程实践
- L1.5 GPT应用案例

阶段2：AI大模型API应用开发工程

目标：掌握AI大模型API的使用和开发，以及相关的编程技能。
内容：
- L2.1 API接口
- L2.1.1 OpenAI API接口
- L2.1.2 Python接口接入
- L2.1.3 BOT工具类框架
- L2.1.4 代码示例
- L2.2 Prompt框架
- L2.2.1 什么是Prompt
- L2.2.2 Prompt框架应用现状
- L2.2.3 基于GPTAS的Prompt框架
- L2.2.4 Prompt框架与Thought
- L2.2.5 Prompt框架与提示词
- L2.3 流水线工程
- L2.3.1 流水线工程的概念
- L2.3.2 流水线工程的优点
- L2.3.3 流水线工程的应用
- L2.4 总结与展望

阶段3：AI大模型应用架构实践

目标：深入理解AI大模型的应用架构，并能够进行私有化部署。
内容：
- L3.1 Agent模型框架
- L3.1.1 Agent模型框架的设计理念
- L3.1.2 Agent模型框架的核心组件
- L3.1.3 Agent模型框架的实现细节
- L3.2 MetaGPT
- L3.2.1 MetaGPT的基本概念
- L3.2.2 MetaGPT的工作原理
- L3.2.3 MetaGPT的应用场景
- L3.3 ChatGLM
- L3.3.1 ChatGLM的特点
- L3.3.2 ChatGLM的开发环境
- L3.3.3 ChatGLM的使用示例
- L3.4 LLAMA
- L3.4.1 LLAMA的特点
- L3.4.2 LLAMA的开发环境
- L3.4.3 LLAMA的使用示例
- L3.5 其他大模型介绍