【领域】百度OCR识别

news2026/2/14 5:50:44

一、定义

OCR（Optical Character Recognition，光学字符识别）是计算机视觉重要方向之一。传统定义的OCR一般面向扫描文档类对象，现在我们常说的OCR一般指场景文字识别（Scene Text Recognition，STR），主要面向自然场景。

二、特性

支持多种 OCR 相关前沿算法，在此基础上打造产业级特色模型PP-OCR、PP-Structure和PP-ChatOCRv2，并打通数据生产、模型训练、压缩、预测部署全流程。
在这里插入图片描述

三、任务

文本检测
文本识别
端到端文本识别
文档分析
PPOCR主要应用于图片中的文字、数字识别，PPstru主要适用于文档级别的页面识别

四、模型

PP-OCR中英文模型

定义

除输入输出外，PP-OCR核心框架包含了3个模块，分别是：文本检测模块、检测框矫正模块、文本识别模块。

文本检测模块：核心是一个基于DB检测算法训练的文本检测模型，检测出图像中的文字区域
检测框矫正模块：将检测到的文本框输入检测框矫正模块，在这一阶段，将四点表示的文本框矫正为矩形框，方便后续进行文本识别，另一方面会进行文本方向判断和校正，例如如果判断文本行是倒立的情况，则会进行转正，该功能通过训练一个文本方向分类器实现
文本识别模块：最后文本识别模块对矫正后的检测框进行文本识别，得到每个文本框内的文字内容，PP-OCR中使用的经典文本识别算法CRNN

PP-OCR模型分为mobile版（轻量版）和server版（通用版），其中mobile版模型主要基于轻量级骨干网络MobileNetV3进行优化，优化后模型（检测模型+文本方向分类模型+识别模型）大小仅8.1M，CPU上平均单张图像预测耗时350ms，T4 GPU上约110ms，裁剪量化后，可在精度不变的情况下进一步压缩到3.5M，便于端侧部署，在骁龙855上测试预测耗时仅260ms。更多的PP-OCR评估数据可参考benchmark。

代码使用

中英文与多语言使用

通过Python脚本使用PaddleOCR whl包，whl包会自动下载ppocr轻量级模型作为默认模型。
检测+方向分类器+识别全流程：

from paddleocr import PaddleOCR, draw_ocr

# Paddleocr目前支持的多语言语种可以通过修改lang参数进行切换
# 例如`ch`, `en`, `fr`, `german`, `korean`, `japan`
ocr = PaddleOCR(use_angle_cls=True, lang="ch")  # need to run only once to download and load model into memory
img_path = './imgs/11.jpg'
result = ocr.ocr(img_path, cls=True)
for idx in range(len(result)):
    res = result[idx]
    for line in res:
        print(line)

# 显示结果
from PIL import Image
result = result[0]
image = Image.open(img_path).convert('RGB')
boxes = [line[0] for line in result]
txts = [line[1][0] for line in result]
scores = [line[1][1] for line in result]
im_show = draw_ocr(image, boxes, txts, scores, font_path='./fonts/simfang.ttf')
im_show = Image.fromarray(im_show)
im_show.save('result.jpg')

如果输入是PDF文件，那么可以参考下面代码进行可视化：

from paddleocr import PaddleOCR, draw_ocr

# Paddleocr目前支持的多语言语种可以通过修改lang参数进行切换
# 例如`ch`, `en`, `fr`, `german`, `korean`, `japan`
PAGE_NUM = 10 # 将识别页码前置作为全局，防止后续打开pdf的参数和前文识别参数不一致 / Set the recognition page number
pdf_path = 'default.pdf'
ocr = PaddleOCR(use_angle_cls=True, lang="ch", page_num=PAGE_NUM)  # need to run only once to download and load model into memory
# ocr = PaddleOCR(use_angle_cls=True, lang="ch", page_num=PAGE_NUM,use_gpu=0) # 如果需要使用GPU，请取消此行的注释 并注释上一行 / To Use GPU,uncomment this line and comment the above one.
result = ocr.ocr(pdf_path, cls=True)
for idx in range(len(result)):
    res = result[idx]
    if res == None: # 识别到空页就跳过，防止程序报错 / Skip when empty result detected to avoid TypeError:NoneType
        print(f"[DEBUG] Empty page {idx+1} detected, skip it.")
        continue
    for line in res:
        print(line)
# 显示结果
import fitz
from PIL import Image
import cv2
import numpy as np
imgs = []
with fitz.open(pdf_path) as pdf:
    for pg in range(0, PAGE_NUM):
        page = pdf[pg]
        mat = fitz.Matrix(2, 2)
        pm = page.get_pixmap(matrix=mat, alpha=False)
        # if width or height > 2000 pixels, don't enlarge the image
        if pm.width > 2000 or pm.height > 2000:
            pm = page.get_pixmap(matrix=fitz.Matrix(1, 1), alpha=False)
        img = Image.frombytes("RGB", [pm.width, pm.height], pm.samples)
        img = cv2.cvtColor(np.array(img), cv2.COLOR_RGB2BGR)
        imgs.append(img)
for idx in range(len(result)):
    res = result[idx]
    if res == None:
        continue
    image = imgs[idx]
    boxes = [line[0] for line in res]
    txts = [line[1][0] for line in res]
    scores = [line[1][1] for line in res]
    im_show = draw_ocr(image, boxes, txts, scores, font_path='doc/fonts/simfang.ttf')
    im_show = Image.fromarray(im_show)
    im_show.save('result_page_{}.jpg'.format(idx))

要使用滑动窗口进行光学字符识别（OCR），可以使用以下代码片段：

from paddleocr import PaddleOCR
from PIL import Image, ImageDraw, ImageFont

# 初始化OCR引擎
ocr = PaddleOCR(use_angle_cls=True, lang="en")

img_path = "./very_large_image.jpg"
slice = {'horizontal_stride': 300, 'vertical_stride': 500, 'merge_x_thres': 50, 'merge_y_thres': 35}
results = ocr.ocr(img_path, cls=True, slice=slice)

# 加载图像
image = Image.open(img_path).convert("RGB")
draw = ImageDraw.Draw(image)
font = ImageFont.truetype("./doc/fonts/simfang.ttf", size=20)  # 根据需要调整大小

# 处理并绘制结果
for res in results:
    for line in res:
        box = [tuple(point) for point in line[0]]
        # 找出边界框
        box = [(min(point[0] for point in box), min(point[1] for point in box)),
               (max(point[0] for point in box), max(point[1] for point in box))]
        txt = line[1][0]
        draw.rectangle(box, outline="red", width=2)  # 绘制矩形
        draw.text((box[0][0], box[0][1] - 25), txt, fill="blue", font=font)  # 在矩形上方绘制文本

# 保存结果
image.save("result.jpg")

PP-Structure文档分析模型

定义

PP-Structure支持版面分析（layout analysis）、表格识别（table recognition）、文档视觉问答（DocVQA）三种子任务。
PP-Structure核心功能点如下：

支持对图片形式的文档进行版面分析，可以划分文字、标题、表格、图片以及列表5类区域（与Layout-Parser联合使用）
支持文字、标题、图片以及列表区域提取为文字字段（与PP-OCR联合使用）
支持表格区域进行结构化分析，最终结果输出Excel文件
支持Python whl包和命令行两种方式，简单易用
支持版面分析和表格结构化两类任务自定义训练
支持VQA任务-SER和RE

代码使用

图像方向分类+版面分析+表格识别

import os
import cv2
from paddleocr import PPStructure,draw_structure_result,save_structure_res

table_engine = PPStructure(show_log=True, image_orientation=True)

save_folder = './output'
img_path = 'ppstructure/docs/table/1.png'
img = cv2.imread(img_path)
result = table_engine(img)
save_structure_res(result, save_folder,os.path.basename(img_path).split('.')[0])

for line in result:
    line.pop('img')
    print(line)

from PIL import Image

font_path = 'doc/fonts/simfang.ttf' # PaddleOCR下提供字体包
image = Image.open(img_path).convert('RGB')
im_show = draw_structure_result(image, result,font_path=font_path)
im_show = Image.fromarray(im_show)
im_show.save('result.jpg')