解锁 PaddleOCR 的超能力

光学字符识别（OCR）是一项强大的技术，使机器能够从图像或扫描文档中识别和提取文本。OCR 在各个领域都有应用，包括文件数字化、从图像中提取文本以及基于文本的数据分析。在本文中，我们将探讨如何使用 PaddleOCR，一款基于深度学习的先进OCR工具包，进行文本检测和识别任务。我们将逐步演示一个代码片段，展示了整个过程。

1.先决条件

在我们深入代码之前，让我们确保我们已经准备好运行 PaddleOCR 库。确保您的计算机上安装了以下必要先决条件：

Python（3.6 或更高版本）
PaddleOCR 库
其他必要的依赖项（例如 NumPy、pandas 等）

您可以使用以下 pip 命令安装 PaddleOCR：

pip install paddleocr

2.设置 PaddleOCR

一旦您安装了 Python 和所需的库，我们来设置 PaddleOCR。您可以使用 PaddleOCR 的预训练模型，这些模型可用于文本检测和识别。

代码概览

使用 PaddleOCR 进行文本检测和识别的代码片段包括以下主要组件：

图像预处理：加载输入图像并执行必要的预处理步骤，例如调整大小或归一化。
文本检测：使用 PaddleOCR 文本检测模型来定位输入图像中文本区域的边界框。
文本识别：对于每个检测到的边界框，使用 PaddleOCR 文本识别模型来提取相应的文本。
后处理：整理检测到的文本和识别结果以进行进一步分析或显示。

3.逐步实现

让我们分解代码片段，详细解释每个步骤：

文本检测

该代码是一个名为 DecMain 的类的一部分，该类专为使用真实数据进行光学字符识别（OCR）评估而设计。它使用 PaddleOCR 从图像中提取文本，然后计算指标（如准确率、召回率和字符错误率 [CER]）来评估 OCR 系统的性能。

class DecMain:
    def __init__(self, image_folder_path, label_file_path, output_file):
        self.image_folder_path = image_folder_path
        self.label_file_path = label_file_path
        self.output_file = output_file


    def run_dec(self):
        # Check and update the ground truth file
        CheckAndUpdateGroundTruth(self.label_file_path).check_and_update_ground_truth_file()


        df = OcrToDf(image_folder=self.image_folder_path, label_file=self.label_file_path, det=True, rec=True, cls=False).ocr_to_df()


        ground_truth_data = ReadGroundTruthFile(self.label_file_path).read_ground_truth_file()


        # Get the extracted text as a list of dictionaries (representing the OCR results)
        ocr_results = df.to_dict(orient="records")


        # Calculate precision, recall, and CER
        precision, recall, total_samples = CalculateMetrics(ground_truth_data, ocr_results).calculate_precision_recall()


        CreateSheet(dataframe=df, precision=precision, recall=recall, total_samples=total_samples,
                    file_name=self.output_file).create_sheet()

让我们分解代码并解释每个部分：

class DecMain:


def __init__(self, image_folder_path, label_file_path, output_file):


self.image_folder_path = image_folder_path


self.label_file_path = label_file_path


self.output_file = output_file

DecMain 类有一个 __init__方法，用以下参数初始化对象：

image_folder_path：用于 OCR 的输入图像所在文件夹的路径。
label_file_path：包含图像的实际文本内容的真实标签文件的路径。
output_file：评估结果将保存在的输出文件的文件名。

def run_dec(self):
       # Check and update the ground truth file
       CheckAndUpdateGroundTruth(self.label_file_path).check_and_update_ground_truth_file()

run_dec方法负责运行 OCR 评估过程。首先，它使用 CheckAndUpdateGroundTruth 类来检查并更新真实标签文件。

df = OcrToDf(image_folder=self.image_folder_path, label_file=self.label_file_path, det=True, rec=True, cls=False).ocr_to_df()

OcrToDf 类用于将 OCR 结果转换为 pandas DataFrame（`df`）。它接受以下参数：

image_folder：包含 OCR 输入图像的文件夹的路径。
label_file：真实标签文件的路径。
det=True和 rec=True参数表示 DataFrame 将包含文本检测和识别结果。

ground_truth_data = ReadGroundTruthFile(self.label_file_path).read_ground_truth_file()

ReadGroundTruthFile 类用于读取真实标签文件并将其内容加载到 ground_truth_data变量中。

# Get the extracted text as a list of dictionaries (representing the OCR results)
        ocr_results = df.to_dict(orient="records")

从 DataFrame df 中获取的 OCR 结果转换为字典列表（ocr_results），每个字典代表单个图像的 OCR 结果。

# Calculate precision, recall, and CER
        precision, recall, total_samples = CalculateMetrics(groun
        d_truth_data, ocr_results).calculate_precision_recall()

CalculateMetrics 类用于计算 OCR 评估指标：准确率、召回率和评估的总样本数。该类将真实数据和 OCR 结果作为输入。

CreateSheet(dataframe=df, precision=precision, recall=recall, total_samples=total_samples,


                   file_name=self.output_file).create_sheet()

CreateSheet 类负责创建输出表格（例如 Excel 或 CSV），其中包含评估指标和 OCR 结果。它接受 DataFrame `df`、准确率、召回率、总样本数和输出文件名作为输入。

总的来说，DecMain 类提供了一种有条理的方式，使用真实数据和 PaddleOCR 的文本检测和识别功能来评估 OCR 模型的性能。它计算重要的评估指标，并将结果存储在指定的输出文件中，以供进一步分析。

注意：真实标签文件的格式

要使用 DecMain 类和提供的代码进行 OCR 评估，必须正确格式化真实标签文件。真实标签文件应采用 JSON 格式，其结构如下所示：

image_name.jpg [{"transcription": "215mm 18", "points": [[199, 6], [357, 6], [357, 33], [199, 33]], "difficult": False, "key_cls": "digits"}, {"transcription": "XZE SA", "points": [[15, 6], [140, 6], [140, 36], [15, 36]], "difficult": False, "key_cls": "text"}]

真实标签文件应为 JSON 格式。文件的每一行代表图像的 OCR 真实标签。

每一行包含图像的文件名，后跟 JSON 对象形式的该图像的 OCR 结果。

JSON 对象应具有以下几点：

"transcription"：图像的真实文本转录。

"points"：表示图像中文本区域边界框坐标的四个点的列表。

"difficult"：一个布尔值，指示文本区域是否难以识别。

"key_cls"：OCR 结果的类别标签，例如 "digits" 或 "text"。

在创建用于准确评估 OCR 模型性能的真实标签文件时，请确保遵循此格式。

文本识别

代码定义了一个名为 RecMain 的类，该类旨在使用预训练的 OCR 模型在图像文件夹上运行文本识别（OCR）并生成一个评估 Excel 表格。

class RecMain:
    def __init__(self, image_folder, rec_file, output_file):
        self.image_folder = image_folder
        self.rec_file = rec_file
        self.output_file = output_file


    def run_rec(self):
        image_paths = GetImagePathsFromFolder(self.image_folder, self.rec_file). \
            get_image_paths_from_folder()


        ocr_model = LoadRecModel().load_model()


        results = ProcessImages(ocr=ocr_model, image_paths=image_paths).process_images()


        ground_truth_data = ConvertTextToDict(self.rec_file).convert_txt_to_dict()


        model_predictions, ground_truth_texts, image_names, precision, recall, \
            overall_model_precision, overall_model_recall, cer_data_list = EvaluateRecModel(results,
                                                                                            ground_truth_data).evaluate_model()


        # Create Excel sheet
        CreateMetricExcel(image_names, model_predictions, ground_truth_texts,
                          precision, recall, cer_data_list, overall_model_precision, overall_model_recall,
                          self.output_file).create_excel_sheet()

让我们分解代码并解释每个部分：

class RecMain:
    def __init__(self, image_folder, rec_file, output_file):
        self.image_folder = image_folder
        self.rec_file = rec_file
        self.output_file = output_file

RecMain类有一个__init__方法，用以下参数初始化对象：

image_folder: 包含用于文本识别的输入图像的文件夹路径。
rec_file: 包含图像实际文本内容的地面真实标签文件的路径。
output_file: 保存评估结果的输出Excel表格的文件名。

def run_rec(self):
        image_paths = GetImagePathsFromFolder(self.image_folder, self.rec_file).get_image_paths_from_folder()

run_rec方法负责运行文本识别过程。它首先使用GetImagePathsFromFolder类来获取指定image_folder内所有图像的图像路径列表。这一步确保OCR模型将处理给定目录内的所有图像。

ocr_model = LoadRecModel().load_model()

LoadRecModel类用于加载用于文本识别的预训练OCR模型。它可能使用PaddleOCR或其他OCR库来加载模型。

results = ProcessImages(ocr=ocr_model, image_paths=image_paths).process_images()

ProcessImages类负责使用加载的OCR模型来处理图像。它以OCR模型（ocr_model）和图像路径列表（image_paths）作为输入。

ground_truth_data = ConvertTextToDict(self.rec_file).convert_txt_to_dict()

ConvertTextToDict类用于读取地面实况标签文件并将其转换为字典格式（ground_truth_data）。这一转换准备了地面实况数据，以便与OCR模型的预测进行比较。

model_predictions, ground_truth_texts, image_names, precision, recall, \
            overall_model_precision, overall_model_recall, cer_data_list = EvaluateRecModel(results,
                                                                                            ground_truth_data).evaluate_model()

EvaluateRecModel类负责将OCR模型的预测与地面实况数据进行比较，并计算评估指标，如精度、召回率和字符错误率（CER）。它以OCR模型的预测（results）和地面实况数据（ground_truth_data）作为输入。

# Create Excel sheet
        CreateMetricExcel(image_names, model_predictions, ground_truth_texts,
                          precision, recall, cer_data_list, overall_model_precision, overall_model_recall,
                          self.output_file).create_excel_sheet()

CreateMetricExcel类负责创建包含评估指标和OCR结果的输出Excel表。它接受各种输入数据，包括图像名称、模型预测、地面实况文本、评估指标和输出文件名（self.output_file）。

总之，RecMain类组织了整个文本识别过程，从加载OCR模型到生成包含详细指标的评估Excel表。它提供了一种有组织和可重复使用的方法，用于评估OCR模型在给定一组图像上的性能。

注：地面实况文本文件格式

使用RecMain类和提供的代码进行OCR评估时，正确格式化地面实况（GT）文本文件至关重要。GT文本文件应采用以下格式：

image_name.jpg text

文件的每一行表示一个图像的GT文本。

每一行包含图像的文件名，后跟一个制表符（\t），然后是该图像的GT文本。

确保GT文本文件包含图像文件夹中指定的所有图像的GT文本条目。GT文本应与图像中实际文本内容相匹配。这种格式对于准确评估OCR模型的性能是必需的。

您可以在这里找到源代码：

https://github.com/vinodbaste/paddleOCR_rec_dec?source=post_page

结论

我们探讨了如何使用基于深度学习的PaddleOCR进行文本检测和识别的过程。我们逐步演示了文本检测和识别的实现。有了PaddleOCR强大的预训练模型和易于使用的API，对图像执行OCR变得更加容易。