01
环境准备
1、windows 环境下载 exe
http://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-setup-4.00.00dev.exe
双击 exe,一路 next 完成 Tesseract-OCR 安装
2、配置环境变量
PATH 增加 D:\ProgramFiles\Tesseract-OCR
新建环境变量 TESSDATA_PREFIX 值为
D:\ProgramFiles\Tesseract-OCR\tessdata
这是将语言字库文件夹添加到环境变量 TESSDATA_PREFIX 中
CMD 命令行窗口输入如下命令:
查看版本号
C:\Users\18611>tesseract -v
tesseract 4.00.00alpha
leptonica-1.74.1
libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.5.0) : libpng 1.6.20: libtiff 4.0.6 : zlib 1.2.8 :
libwebp 0.4.3 : libopenjp2 2.1.0
查看支持的语言包
C:\Users\18611>tesseract --list-langs
List of available languages (2):
eng
osd
C:\Users\18611>
02
命令识别图片
识别如下图片验证码
使用 tesseract 命令识别图片中的内容
C:\Users\18611>cd Desktop
C:\Users\18611\Desktop>tesseract test2.png output
Tesseract Open Source OCR Engine v4.00.00alpha with Leptonica
C:\Users\18611\Desktop>
【语法】:tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfile…]
imagename 为目标图片文件名,需加格式后缀;
outputbase 是转换结果文件名;
lang 是语言名称(在 Tesseract-OCR 中 tessdata 文件夹可看到以 eng 开头的语言文件 eng.traineddata),如不标-l eng 则默认为 eng。
03
java自动识别图片
将 tesseract.exe 命令保存为 bat 文件,bat 内容为:
//图片路径 D:\Tesseract-OCR\test.png 生成 txt 文件存放路径及文件名 result
代码实现如下:
package com.mtx.util;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStreamReader;
/** * @ClassName ReadCpacha
* @Description TODO
* @Author 彩虹 rainbow QQ3130978832
* @Date-Time 2022/6/9 13:55
* @ProjectName MtxPublic
* @Copyright 北京码同学网络科技有限公司
**/
public class ReadCpacha{
public static String readPic(){
String cmd= "cmd /c start D:\\Tesseract-OCR\\tesseract.bat";
try {
Runtime.getRuntime().exec(cmd);
} catch(Exception e) {
e.printStackTrace();
}
try {
//线程阻塞 3 秒等待 tesseract.exe 执行完成
Thread.sleep(3000);
}catch (InterruptedException e) {
e.printStackTrace();
}
//执行 tesseract.exe 识别图片后生成 result.txt 文件中保存识别后验证码
//读取 result.txt 文件获取验证码
// ReadTxt
BufferedReader bufferedReader = new BufferedReader(inputStreamReader);
StringBuffer sb= new StringBuffer();
String text = null;
while((text = bufferedReader.readLine()) != null){
//逐行读取到的字符串存到 StringBuffer 对象
sb.append(text);
}
return sb.toString();
}catch (Exception e) {
e.printStackTrace();
}
}
return null;
}
public static void main(String[] args) {
String str = readPic();//调用封装方法测试
System.out.println(str);
}
}
C:\Users\18611\IdeaProjects\MtxPublic>tesseract --help-psm
Page segmentation modes:
0 Orientation and script detection (OSD) only.
1 Automatic page segmentation with OSD.
2 Automatic page segmentation, but no OSD, or OCR.
3 Fully automatic page segmentation, but no OSD. (Default)
4 Assume a single column of text of variable sizes.
5 Assume a single uniform block of vertically aligned text.
6 Assume a single uniform block of text.
7 Treat the image as a single text line.
8 Treat the image as a single word.
9 Treat the image as a single word in a circle.
10 Treat the image as a single character.
11 Sparse text. Find as much text as possible in no particular order.
12 Sparse text with OSD.
13 Raw line. Treat the image as a single text line, bypassing hacks that are Tesseract-specific.
C:\Users\18611\IdeaProjects\MtxPublic>