【Python-MP4文体提取】

news2026/2/16 0:28:58

Python-MP4文体提取

■ pip 和 setuptools工具
■ OpenCV和Tesseract
■ Tesseract OCR V5.0安装教程（Windows）
- ■ 1. 运行程序出现如下问题：我们需要安装Tesseract OCR
- ■ 2. 下载Tesseract-OCR
- ■ 3. 安装Tesseract-OCR
- ■ 4. 添加到环境变量的系统变量（PATH）去
- ■ 5. 增加一个TESSDATA_PREFIX变量名，
- ■ 6. 打开终端，输入：tesseract -v，可以看到版本信息
- ■ 7. 在pytesseract库下的pytesseract.py文件中找到tesseract_cmd = 'tesseract'，修改成 tesseract_cmd =r'C:\Program Files\Tesseract-OCR\tesseract.exe'
- ■ 8. 再去运行程序
■ 运行代码
■ 运行代码2-openAI生成的

■ pip 和 setuptools工具

先对 pip 和 setuptools工具更新最新版本

python.exe -m pip install --upgrade pip setuptools   
或
pip install --upgrade pip setuptools

■ OpenCV和Tesseract

使用以下命令安装OpenCV和Tesseract:

pip install opencv-python
pip install tesseract

■ Tesseract OCR V5.0安装教程（Windows）

■ 1. 运行程序出现如下问题：我们需要安装Tesseract OCR

在这里插入图片描述

■ 2. 下载Tesseract-OCR

官方网站：https://github.com/tesseract-ocr/tesseract
官方文档：https://github.com/tesseract-ocr/tessdoc
语言包地址：https://github.com/tesseract-ocr/tessdata
下载地址：https://digi.bib.uni-mannheim.de/tesseract/

下载地址
在这里插入图片描述

■ 3. 安装Tesseract-OCR

在这里插入图片描述

■ 4. 添加到环境变量的系统变量（PATH）去

在这里插入图片描述

■ 5. 增加一个TESSDATA_PREFIX变量名，

增加一个TESSDATA_PREFIX变量名，变量值还是我的安装路径C:\Program Files\Tesseract-OCR\tessdata这是将语言字库文件夹添加到变量中;
在这里插入图片描述

■ 6. 打开终端，输入：tesseract -v，可以看到版本信息

在这里插入图片描述

■ 7. 在pytesseract库下的pytesseract.py文件中找到tesseract_cmd = ‘tesseract’，修改成 tesseract_cmd =r’C:\Program Files\Tesseract-OCR\tesseract.exe’

在这里插入图片描述
我的路径
D:\software\Python\Python312\Lib\site-packages\pytesseract

■ 8. 再去运行程序

在这里插入图片描述
结果提取的数据不符合预期要求

■ 运行代码

import cv2
import pytesseract
import re

# 读取视频文件
def extract_frames(video_path):
    cap = cv2.VideoCapture(video_path)
    
    frames = []
    
    while cap.isOpened():
        ret, frame = cap.read()
        
        if not ret:
            break
        
        frames.append(frame)
    
    cap.release()
    
    return frames

# 文字识别
def recognize_text(image):
    text = pytesseract.image_to_string(image)
    
    return text

# 降噪声或错误的字符
def process_text(text):
    processed_text = re.sub(r'\s+|[^\w\s]', '', text)
    
    return processed_text

#video_path = 'E:\PythonProject\tcipy\1111111111.mp4'
video_path = '1111111111.mp4'

frames = extract_frames(video_path)

for frame in frames:
    text = recognize_text(frame)
    processed_text = process_text(text)
    print(processed_text)

print('OK')

■ 运行代码2-openAI生成的

import cv2
import pytesseract
from pytesseract import Output
import pandas as pd

# 打开MP4文件
cap = cv2.VideoCapture('video.mp4')

# 检查文件是否成功打开
if not cap.isOpened():
    print("Error: Could not open video file.")
    exit()

# 使用Tesseract进行OCR
pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

# 初始化表格
data = {'Frame Number': [], 'Text': []}

# 读取视频并解析文字内容
frame_number = 0
while True:
    # 从文件中读取一帧
    ret, frame = cap.read()

    # 检查是否成功读取帧
    if not ret:
        print("Error: Could not read frame.")
        break

    # 进行文字识别
    d = pytesseract.image_to_data(frame, output_type=Output.DICT)

    # 提取识别到的文字
    for i in range(len(d['text'])):
        text = d['text'][i].strip()
        if text:
            data['Frame Number'].append(frame_number)
            data['Text'].append(text)

    frame_number += 1

# 释放资源
cap.release()

# 创建DataFrame
df = pd.DataFrame(data)

# 输出DataFrame到CSV文件
df.to_csv('text_from_video.csv', index=False)