图片文字识别-管理敏感词

什么是OCR

Tess4j案例

图片文字识别-管理敏感词

什么是OCR

OCR （Optical Character Recognition，光学字符识别）是指电子设备（例如扫描仪或数码相机）检查纸上打印的字符，通过检测暗、亮的模式确定其形状，然后用字符识别方法将形状翻译成计算机文字的过程

方案	说明
百度OCR	收费
Tesseract-OCR	Google维护的开源OCR引擎，支持Java，Python等语言调用
Tess4J	封装了Tesseract-OCR ，支持Java调用

Tesseract-OCR特点:

Tesseract支持UTF-8编码格式，并且可以“开箱即用”地识别100多种语言。
Tesseract支持多种输出格式:纯文本，hOCR (HTML),PDF等
官方建议，为了获得更好的OCR结果，最好提供给高质量的图像。
Tesseract进行识别其他语言的训练。具体的训练方式，请参考官方提供的文档: https://github.com/tesseract-ocr/tessdoc

Tess4j案例

创建项目导入tess4j对应的依赖

<dependency>
    <groupId>net.sourceforge.tess4j</groupId>
    <artifactId>tess4j</artifactId>
    <version>4.1.1</version>
</dependency>

导入中文字体库，把资料中的tessdata文件夹拷贝到自己的工作空间下

编写测试类进行测试

import net.sourceforge.tess4j.ITesseract;
import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;

import java.io.File;

public class Application {

    public static void main(String[] args) throws TesseractException {
        //创建Tesseract对象
        ITesseract tesseract = new Tesseract();
        //设置字体库路径
        tesseract.setDatapath("D:\\");
        //设置识别的语言-简体中文
        tesseract.setLanguage("chi_sim");
        //执行ocr识别（识别图片）
        String result = tesseract.doOCR(new File("D:\\123.png"));
        //替换回车和tal键  使结果为一行
        System.out.println("识别的结果为："+result);
    }
}

图片文字识别-管理敏感词

一、首先创建一个父工程tess4j

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <parent>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-parent</artifactId>
        <version>2.3.9.RELEASE</version>
    </parent>

    <groupId>org.example</groupId>
    <artifactId>tess4j</artifactId>
    <packaging>pom</packaging>
    <version>1.0-SNAPSHOT</version>
    <modules>
        <module>tess4j-test</module>
        <module>utils</module>
    </modules>

    <properties>
        <maven.compiler.source>8</maven.compiler.source>
        <maven.compiler.target>8</maven.compiler.target>
    </properties>

</project>

其次创建两个子工程分别为：tess4j-test、utils

二、创建utils模块

1.配置utils模块的pom文件

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <parent>
        <artifactId>tess4j</artifactId>
        <groupId>org.example</groupId>
        <version>1.0-SNAPSHOT</version>
    </parent>
    <modelVersion>4.0.0</modelVersion>

    <artifactId>utils</artifactId>

    <dependencies>
        <dependency>
            <groupId>net.sourceforge.tess4j</groupId>
            <artifactId>tess4j</artifactId>
            <version>4.1.1</version>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-test</artifactId>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>
        <dependency>
            <groupId>org.projectlombok</groupId>
            <artifactId>lombok</artifactId>
            <scope>provided</scope>
        </dependency>
    </dependencies>

</project>

2.创建图片文字识别工具类Tess4jClient

package com.test.utils;

import lombok.Getter;
import lombok.Setter;
import net.sourceforge.tess4j.ITesseract;
import net.sourceforge.tess4j.Tesseract;
import net.sourceforge.tess4j.TesseractException;
import org.springframework.boot.context.properties.ConfigurationProperties;
import org.springframework.stereotype.Component;

import java.io.File;

@Getter
@Setter
@Component
@ConfigurationProperties(prefix = "tess4j")
public class Tess4jClient {

    private String dataPath;//字体库路径
    private String language;//字体库类型：中文  英文  日文等

    public String doOCR(File image) throws TesseractException {
        //创建Tesseract对象
        ITesseract tesseract = new Tesseract();
        //设置字体库路径
        tesseract.setDatapath(dataPath);
        //中文识别
        tesseract.setLanguage(language);
        //执行ocr识别
        String result = tesseract.doOCR(image);
        //替换回车和tal键  使结果为一行
        result = result.replaceAll("\\r|\\n", "-").replaceAll(" ", "");
        return result;
    }
}

3.在resources下创建META-INF文件夹，在该文件夹下创建spring.factories文件并在配置中添加该类,完整如下：

org.springframework.boot.autoconfigure.EnableAutoConfiguration=\
  com.test.utils.Tess4jClient

4.创建自管理敏感词审核工具类SensitiveWordUtil

package com.test.utils;

import java.util.*;

public class SensitiveWordUtil {

    public static Map<String, Object> dictionaryMap = new HashMap<>();

    /**
     * 生成关键词字典库
     * @param words
     * @return
     */
    public static void initMap(Collection<String> words) {
        if (words == null) {
            System.out.println("敏感词列表不能为空");
            return ;
        }
        // map初始长度words.size()，整个字典库的入口字数(小于words.size()，因为不同的词可能会有相同的首字)
        Map<String, Object> map = new HashMap<>(words.size());
        // 遍历过程中当前层次的数据
        Map<String, Object> curMap = null;
        Iterator<String> iterator = words.iterator();
        while (iterator.hasNext()) {
            String word = iterator.next();
            curMap = map;
            int len = word.length();
            for (int i =0; i < len; i++) {
                // 遍历每个词的字
                String key = String.valueOf(word.charAt(i));
                // 当前字在当前层是否存在, 不存在则新建, 当前层数据指向下一个节点, 继续判断是否存在数据
                Map<String, Object> wordMap = (Map<String, Object>) curMap.get(key);
                if (wordMap == null) {
                    // 每个节点存在两个数据: 下一个节点和isEnd(是否结束标志)
                    wordMap = new HashMap<>(2);
                    wordMap.put("isEnd", "0");
                    curMap.put(key, wordMap);
                }
                curMap = wordMap;
                // 如果当前字是词的最后一个字，则将isEnd标志置1
                if (i == len -1) {
                    curMap.put("isEnd", "1");
                }
            }
        }
        dictionaryMap = map;
    }

    /**
     * 搜索文本中某个文字是否匹配关键词
     * @param text
     * @param beginIndex
     * @return
     */
    private static int checkWord(String text, int beginIndex) {
        if (dictionaryMap == null) {
            throw new RuntimeException("字典不能为空");
        }
        boolean isEnd = false;
        int wordLength = 0;
        Map<String, Object> curMap = dictionaryMap;
        int len = text.length();
        // 从文本的第beginIndex开始匹配
        for (int i = beginIndex; i < len; i++) {
            String key = String.valueOf(text.charAt(i));
            // 获取当前key的下一个节点
            curMap = (Map<String, Object>) curMap.get(key);
            if (curMap == null) {
                break;
            } else {
                wordLength ++;
                if ("1".equals(curMap.get("isEnd"))) {
                    isEnd = true;
                }
            }
        }
        if (!isEnd) {
            wordLength = 0;
        }
        return wordLength;
    }

    /**
     * 获取匹配的关键词和命中次数
     * @param text
     * @return
     */
    public static Map<String, Integer> matchWords(String text) {
        Map<String, Integer> wordMap = new HashMap<>();
        int len = text.length();
        for (int i = 0; i < len; i++) {
            int wordLength = checkWord(text, i);
            if (wordLength > 0) {
                String word = text.substring(i, i + wordLength);
                // 添加关键词匹配次数
                if (wordMap.containsKey(word)) {
                    wordMap.put(word, wordMap.get(word) + 1);
                } else {
                    wordMap.put(word, 1);
                }
                i += wordLength - 1;
            }
        }
        return wordMap;
    }
}

三、创建tess4j-test模块

1.配置tess4j-test模块的pom文件（导入utils模块）

2.在tess4j-test中的配置中添加两个属性

tess4j:
  data-path: D:\workspace\tessdata //字体库路径
  language: chi_sim // 字体库类型（简体中文）

3.测试

package com.example;

import com.test.utils.SensitiveWordUtil;
import com.test.utils.Tess4jClient;
import net.sourceforge.tess4j.TesseractException;
import org.junit.jupiter.api.Test;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.context.SpringBootTest;

import java.io.File;
import java.util.*;

@SpringBootTest
class Tess4jTestApplicationTests {

    @Autowired
    private Tess4jClient tess4jClient;

    /**
     * 识别图片文本内容，并审核是否包含敏感词
     */
    @Test
    void contextLoads() throws TesseractException {
        //识别图片中的文字
        String result = tess4jClient.doOCR(new File("D:\\123.png"));
        //审核是否包含自管理的敏感词
        Map sensitiveScan = handleSensitiveScan(result);
        boolean flag = (boolean) sensitiveScan.get("flag");
        if (!flag) {
            System.out.println("图中包含敏感词:" + sensitiveScan.get("map"));
        }
    }

    /**
     * 自管理的敏感词审核
     * @param content
     * @return
     */
    private Map<String,Object> handleSensitiveScan(String content) {
        boolean flag = true;
        //模拟查询数据库获取所有的敏感词
        List<String> sensitiveList = new ArrayList<>();
        sensitiveList.add("私人侦探");
        sensitiveList.add("私人调查");
        sensitiveList.add("企业打侵");
        //初始化敏感词库
        SensitiveWordUtil.initMap(sensitiveList);
        //查看文章中是否包含敏感词
        Map<String, Integer> map = SensitiveWordUtil.matchWords(content);
        if(map.size() >0){
            flag = false;
        }
        Map<String,Object> resultMap = new HashMap();
        resultMap.put("flag",flag);
        resultMap.put("map",map);
        return resultMap;
    }
}

图片如下：