tesseract-ocr 提高驗證碼識別率手段之---識別碼庫訓練方法

阿新 • • 發佈：2019-01-30

關於ORC驗證碼識別可以看本部落格的另一篇文章

本文是對tesseract-ocr 使用的進一步技術升級說明，使用預設的識別庫識別率比較低怎麼辦？

不用著急，tesseract-ocr本身的工具中提供了使用你提供的素材進行人工修正以提高識別率的方法。下面我們就來看一下。

參考：

1 下載並安裝3.02版本的tesseract

2 如果你的訓練素材是很多張非tiff格式的圖片，首先要做的事情就是將這麼圖片合併（個人覺得素材越多，基本每個字母和數字都覆蓋了訓練出來的識別率比較好）

首先進行jpg,gif,bmp到tiff的轉換，這個用自帶的畫圖就可以。然後使用VietOCR.NET-3.3

進行多張 tiff的merge。

3 Make Box Files。在orderNo.tif所在的目錄下開啟一個命令列，輸入

C:\Program Files\Tesseract-OCR>tesseract.exe lang.jhy.exp8.TIF lang.jhy.exp8 batch.nochop makebox

4 使用jTessBoxEditor開啟orderNo.tif檔案，需要記住的是第2步生成的orderNo.box要和這個orderNo.tif檔案同在一個目錄下。逐個校正文字，後儲存。

下載jTessBoxEditor工具進行每個自的糾正（注意有nextpage

逐頁進行糾正）

5 Run Tesseract for Training。輸入命令：

C:\Program Files\Tesseract-OCR>tesseract.exe lang.jhy.exp8.TIF lang.jhy.exp8 nob

atch box.train

補充關於命名格式解釋：lang.jhy.exp8.TIF

Make Box Files

For the next step below, Tesseract needs a 'box' file to go with each training image. The box file is a text file that lists the characters in the training image, in order, one per line, with the coordinates of the bounding box around the image. Tesseract 3.0 has a mode in which it will output a text file of the required format, but if the character set is different to its current training, it will naturally have the text incorrect. So the key process here is to manually edit the file to put the correct characters in it.

Run Tesseract on each of your training images using this command line:

tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] batch.nochop makebox

6 Compute the Character Set。輸入命令：

C:\Program Files\Tesseract-OCR>unicharset_extractor.exe lang.jhy.exp8.box

Extracting unicharset from lang.jhy.exp8.box

Wrote unicharset file ./unicharset.

7 新建檔案“font_properties”。如果是3.01版本，那麼需要在目錄下新建一個名字為“font_properties”的檔案，並且輸入文字 :（這裡的jhy就是lang.jhy.exp8的中間欄位）

jhy 1 0 0 1 0

C:\Program Files\Tesseract-OCR>mftraining.exe -F font_properties -U unicharset

ang.jhy.exp8.tr

Warning: No shape table file present: shapetable

Reading lang.jhy.exp8.tr ...

Flat shape table summary: Number of shapes = 18 max unichars = 1 number with mu

tiple unichars = 0

Done!

8 Clustering。輸入命令：

C:\Program Files\Tesseract-OCR>cntraining.exe lang.jhy.exp8.tr

Reading lang.jhy.exp8.tr ...

Clustering ...

Writing normproto ...

9 此時，在目錄下應該生成若干個檔案了，把unicharset, inttemp, normproto, pfftable這幾個檔案加上字首“selfverify.”。然後輸入命令：

必須確定的是1、3、4、5、13行的資料不是-1，那麼一個新的字典就算生成了。

此時目錄下“selfverify.traineddata”的檔案拷貝到tesseract程式目錄下的“tessdata”目錄。

以後就可以使用該該字典來識別了，例如：

tesseract.exe test.jpg out –l selfverify

通過訓練出來的新語言，識別率提高了不少。

tesseract-ocr 提高驗證碼識別率手段之---識別碼庫訓練方法

關於ORC驗證碼識別可以看本部落格的另一篇文章

tesseract-ocr 提高驗證碼識別率手段之---識別碼庫訓練方法

python tesseract-ocr 基礎驗證碼識別功能（Windows）

python使用tesseract-ocr完成驗證碼識別

python實現人臉檢測及識別（2）---- 利用keras庫訓練人臉識別模型

python+pillow+pytesseract+Tesseract-OCR驗證碼識別[轉]

關於Python驗證碼識別安裝PIL、tesseract-ocr與pytesseract模組的錯誤解決

nodeJS實現識別驗證碼（tesseract-ocr+GraphicsMagick）

利用百度OCR實現驗證碼自動識別

Tesseract做圖片驗證碼識別

tesseract+opencv進行驗證碼識別

Tesseract與tess4j驗證碼識別

Tika結合Tesseract-OCR 實現光學漢字識別（簡體、宋體的識別率百分之百）—附Java原始碼實現及真實測試資料和訓練集下載地址

Tesseract-OCR-v5.0中文識別,訓練自定義字型檔,提高圖片的識別效果

Python調用Tesseract-OCR完成圖片OCR識別

Python3.x：pytesseract識別率提高（樣本訓練）

Tesseract-OCR-03-圖片文字識別

圖片文字識別：Tesseract OCR庫在Python中基本使用

Python爬蟲基礎：驗證碼的爬取和識別詳解

Linux下 (Ubuntu16.04 ) Tesseract4.0訓練字型檔，提高正確識別率Linux下

tesseract-ocr 使用java進行識別

tesseract-ocr 提高驗證碼識別率手段之---識別碼庫訓練方法

關於ORC驗證碼識別可以看本部落格的另一篇文章

相關推薦