ubuntu下使用Tesseract-ocr(編譯、安裝、使用、訓練新的語言庫)

阿新 • • 發佈：2019-01-05

這是關於如何使用Tesseract3訓練新的語言的文件，該文件是tesseract-ocr官方wiki上翻譯過來的。

1.介紹

Tesseract3.0x是支援訓練的。這篇文章描述如何訓練的過程，提供適用於各種語言的一些指導方針，以及訓練會得到的結果。對於Tesseract2.0x的訓練參考：TrainingTesseract.

2.背景及侷限性

Tesseract原來僅僅為了識別英文而設計的。我們做了許多努力來使得識別引擎及訓練系統能夠應對不同語言及UTF-8字元。Tesseract3.0能夠處理任意的UTF-8字元，但是還是隻能對部分的語言成功處理，因此在期望Tesseract能夠對你的特定語言處理前，你要注意下這個細節。

Tesseract3.01添加了從上到下排列的語言，Tesseract3.02添加了希伯來語(Hebrew)（從右到左排列）。現在Tesseract可以使用一個輔助引擎(稱為cube)來應對類似阿拉伯語(Arabic)的文字。

Tesseract對於大量字符集的語言(如中文)訓練/識別緩慢(譯者注：由於字符集非常大，增加了訓練/匹配時間)，但是也能夠正常工作。

Tesseract需要了解一個字元的形狀，通過將不同字型明確分開。之前對字型數量限制在32個，現在增加到了64個。這個可以通過intproto.h中常量MAX_NUM_CONFIGS 進行設定。注意執行時間很大程度上依賴於字型數量，如果訓練字型數量大於32將會非常緩慢。

如果訓練的語有不一樣的標點及數字，會不利於一些硬編碼的演算法，在這些演算法中，都假設是ASCII字符集中的標點及數字。這個侷限在3.0x(x>=2)版本中已經修改。

你需要在你的輸入檔案路徑下執行所有的命令列。

3.所需的附加庫

從3.03版本開始，需要一些附加庫用來建立你的訓練工具：

sudo apt-get install libicu-dev
sudo apt-get install libpango1.0-dev
sudo apt-get install libcairo2-dev

4.建立訓練工具

從3.03版本開始，如果你從原始碼編譯Tesseract，你需要另外使用make命令來安裝訓練工具：

make training
sudo make training-install

5.所需的資料檔案

為了訓練另一種語言，需要在tessdata子目錄下建立一些資料檔案，然後在使用combine_tessdata把這些檔案合併成單個檔案。命名約定為：languagecode.file_name，Language codes最好按照ISO639-3標準。English對應的訓練依賴檔案(3.00)為：

tessdata/eng.config
tessdata/eng.unicharset
tessdata/eng.unicharambigs
tessdata/eng.inttemp
tessdata/eng.pffmtable
tessdata/eng.normproto
tessdata/eng.punc-dawg
tessdata/eng.word-dawg
tessdata/eng.number-dawg
tessdata/eng.freq-dawg

最後合併的檔案為：

tessdata/eng.traineddata

並且檔案

tessdata/eng.user-words

仍然可以單獨提供。合併的traineddata只是簡單的將輸入檔案串聯，通過一個列表目錄記錄已知的檔案型別的偏移量。可以檢視原始碼ccutil/tessdatamanager.h中當前接受的檔名。注意traindata中包含的檔案和3.00版本之前的已經不同。並且可能在今後的修訂中大幅修改。

5.1 文字輸入檔案的要求

文字輸入檔案(如lang.config, lang.unicharambigs, font_properties, box files, wordlists for dictionaries…)需要滿足以下條件：

沒有BOM的ASCII或者UTF-8字元
Unix下的行結束符(‘\n’)
檔案結束符為行結束符(‘\n’).否則會提示錯誤資訊“ast_char == ‘\n’:Error:Assert failed…”

5.2 你可以忽略哪些？

unicharset, inttemp, normproto, pfftable這些檔案是一定要建立的。如果你只要識別相似或一個字型，那麼一個簡單的訓練也就夠了。其他的檔案不太需要，但是可以幫助提高精度，這要看你的需求了。老版的DangAmbigs已經被unicharambigs替代。

6.訓練過程

儘管已經有很多自動化的過程，但是有些步驟還是得手動。以後可能會有更多的自動化工具，但是需要一些複雜的安裝過程。下面這些工具都是在訓練子目錄下自帶的。

6.1 生成訓練圖片

首先確定需要的字符集，然後準備一個保護這些字元的文字。在建立訓練檔案時需要牢記以下幾點：
+確保每個字元有最低的樣本數，10個的話很好，對於稀少的字元5個也OK。

對於常用的字元需要更多的樣本——至少20個。
對於標點和字元要打散。比如“The quick brown fox jumps over the lazy dog. 0123456789 [email protected]#$%^&(),.{}<>/?”這樣就很差。最好是類似這樣排列：“The (quick) brown {fox} jumps! over the $3,456.78 #90 dog & duck/goose, as 12.5% of E-mail from [email protected] is spam?”這樣可以讓textline finding程式碼對於特殊字元更好的找到baseline.

6.2 自動方式-建立tif/box檔案

準備UTF-8的字元文字(training_text.txt)包含你所需要的字元。獲取你想要識別的字元字型(trueType(微軟和Apple公司共同研製的字型標準)或者是其他字型)。對於每一種字型執行下面的命令列，建立對應的tif/box檔案：

training/text2image —text=training_text.txt —outputbase=[lang].[fontname].exp0 —font=’Font Name’ —fonts_dir=/path/to/your/fonts

注意引數—font可以包含空格，因此必須加上引號，例如：

training/text2image —text=training_text.txt —outputbase=eng.TimesNewRomanBold.exp0 —font=’Times New Roman Bold’ —fonts_dir=/usr/share/fonts

text2image還有許多其他命令列引數，可以檢視traning/text2image.cpp獲取更多資訊。

如果你的應用可以使用text2image,非常好！你可以直接跳到6.4執行tesseract進行訓練。

6.3 手動方式-建立tif/box檔案

6.3.1 獲取tif影象

在列印文字時需要凸出空格，因此在文字編輯器中就需要增大字元間距及行間距。不充足的空格距離可能導致生成*.tr檔案時彈出“FAILURE! box overlaps no blobs or blobs in multiple rows”錯誤資訊，而這又會導致另一個錯誤————某個字元”x”沒有樣本，然後彈出“”Error: X classes in inttemp while unicharset contains Y unichars”，這樣的訓練資料就不能用了。以後我們會解決這種問題，但是3.00的版本還是會出現這種情況。
訓練檔案應該要以字型區分。理想的情況是單一字元的所有樣本都在同一個tiff影象中，但這樣可能需要多頁tiff(如果你安裝了libtiff或者是leptonica)，這樣單一的字型中包含的訓練資料可能有很多頁並且包含成千上萬的字元，同時允許對大字符集語言進行訓練。
不要再一個影象檔案中混淆字型這會導致聚類時特徵的丟失，從而導致識別錯誤。

下一步就是列印並掃描影象，用來建立你的訓練頁。最多支援64個訓練頁。最好建立包含斜體及黑體的混合的字型及樣式（但是要在不同的檔案中）。

注意對真實影象進行訓練有點困難，這是由於間隔寬度的要求。這在今後的版本中會得到改善。

同時你需要儲存一份包含訓練文字的UTF-8的檔案，在後面的步驟中將會用到。

對於大量的訓練資料說明.64個圖片是對字型個數的限制。每個字型應該放在單一的多頁tiff檔案中，並且box檔案可以對指定頁碼的字元座標修改。因此對於給定字型可以建立任意數量的訓練資料，也執行對大字符集語言進行訓練。對於一個字型，也可以用多個單頁tiff檔案來代替多頁tiff檔案, and then you must cat together the tr files for each font into several single-font tr files.不管如何，輸入給mftraining的tr檔案，必須各自包含單個字型。

6.3.2生成box檔案

下一步，Tesseract對每個訓練影象需要一個’box’檔案。box檔案是一個文字檔案，按序排列了訓練圖片的字元及字元的外包矩形框座標。Tesseract3.0有一個模式可以生成所需格式的box檔案，然後你需要手動編輯這個box檔案，使得正確的字元和位置能夠對應。

對每個訓練圖片執行如下命令列：

tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] batch.nochop makebox

例如：

tesseract eng.timesitalic.exp0.tif eng.timesitalic.exp0 batch.nochop makebox

現在是最困難的地方。你需要對[lang].[fontname].exp[num].box檔案進行編輯，你需要在每一行開始的位置輸入正確的UTF-8格式字元，來代替Tesseract自動生成的錯誤的字元。例如，下面是例子圖片eurotext.tif的輸出box檔案(第141-154行):

s 734 494 751 519 0
p 753 486 776 518 0
r 779 494 796 518 0
i 799 494 810 527 0
n 814 494 837 518 0
g 839 485 862 518 0
t 865 492 878 521 0
u 101 453 122 484 0
b 126 453 146 486 0
e 149 452 168 477 0
r 172 453 187 476 0
d 211 451 232 484 0
e 236 451 255 475 0
n 259 452 281 475 0

由於Tesseract是在英文模式下執行的，它不能正確的識別變音符號。這個需要一個合適的編輯器來輸入。一個支援UTF-8的編輯器就夠了，HTML編輯器是個不錯的選擇（linux下的Mozilla可以直接編輯UTF-8文字，Firefox和IE瀏覽器則不支援）。MS的Word支援不同的字元編碼，Notepad++也支援。Linux和Windows都有一個字元列表用來拷貝不能手打的字元。如用ü代替u。

理論上，每一行都只有一個字元。但是在水平方向上分離的一個字元，可能被單成兩個，例如下雙引號“„”，你需要合併這兩個box：例如，第116-229行：

D 101 504 131 535 0
e 135 502 154 528 0
r 158 503 173 526 0
, 197 498 206 510 0
, 206 497 214 509 0
s 220 501 236 526 0
c 239 501 258 525 0
h 262 502 284 534 0
n 288 501 310 525 0
e 313 500 332 524 0
l 336 501 347 534 0
l 352 500 363 532 0
e 367 499 386 524 0
” 389 520 407 532 0

注意2,3列表示左上角座標lefttop，4,5列表示右下角座標rightbottom，最後一列是對應的多頁tiff影象中的頁碼。座標系以圖上左上角為原點。合併後的結果：

D 101 504 131 535 0
e 135 502 154 528 0
r 158 503 173 526 0
„ 197 497 214 510 0
s 220 501 236 526 0
c 239 501 258 525 0
h 262 502 284 534 0
n 288 501 310 525 0
e 313 500 332 524 0
l 336 501 347 534 0
l 352 500 363 532 0
e 367 499 386 524 0
” 389 520 407 532 0

已經有許多個視覺化Box編輯器，請點選AddOns wiki.

6.3.3引導新的字符集

如果你訓練的是全新的字符集，你可以先使用一種字型來獲取box檔案，再執行下面的訓練步驟生成一個traindata，然後再使用Tesseract處理其他的字型的box檔案：

tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] -l yournewlanguage batch.nochop makebox

這樣能提高你的box檔案中的字元正確率，減少編輯。你可以用這種方法來增加新的字型，但是要注意沒有一種訓練模式可以在已有的traindata中新增新的訓練資料。這意味著每一次你執行mfTraining和cnTraining都是從你提供的tr檔案建立一個新的資料檔案，並不可以直接在現有的intproto/pffmtable/normproto中直接新增。

6.3.4 tif/box檔案要一一對應

一些tif/box檔案可以在下載頁面下載。(注意tif檔案經過G4壓縮，要使用libtiff進行解壓縮)。你可以按照下面操作獲取更好的訓練資料：

1. 過濾box檔案，保留你想要的字元行。
1. 執行tesseract進行訓練(後面介紹)。
1. 在多種語言中獲取你想要字元的每種字型的tr檔案，並且新增你自己的字型或者字元。
1. 以相同的方式獲取已過濾的box檔案到.tr檔案中，以便在unicharset_extractor中處理。
1. 進行剩餘的訓練過程。

注意！這沒有想象的那麼簡單！(cntraining and mftraining can only take up to 64 .tr files, so you must cat all the files from multiple languages for the same font together to make 64 language-combined, but font-individual files. The characters found in the tr files must match the sequence of characters found in the box files when given to unicharset_extractor, so you have to cat the box files together in the same order as the tr files. The command lines for cn/mftraining and unicharset_extractor must be given the .tr and .box files (respectively) in the same order just in case you have different filtering for the different fonts. There may be a program available to do all this and pick out the characters in the style of character map. This might make the whole thing easier)

跳轉到的地方

6.4 執行tesseract進行訓練

對每一對訓練圖片和box檔案，執行下面命令列：

tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] box.train

或者

tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] box.train.stderr

第一個命令列會把所有錯誤資訊記錄在tesseract.log中。第二個命令列會在stderr中輸出。

注意box檔名和tif檔名要相同，並且在相同路徑下，否則Tesseract會找不到。輸出檔案為fontfile.tr，這個檔案保護訓練頁中每個字元的特徵。[lang].[fontname].exp[num].txt will also be written with a single newline and no text.

重要的是檢查apply_box輸出中是否有錯誤。如果存在FATALITIES報告，那麼沒有必要繼續訓練過程指導你修復了box檔案。新的box.train.stderr配置檔案讓輸出的定位變得更簡單。一個FATALITY通常提示當前步驟在你的box檔案中尋找某個字元的訓練樣本時失敗。不是座標錯了，就是圖片中對應字元圖片出錯了。如果一個字元不存在可以操作的樣本，它就不能被識別，生產的inttemp檔案也不會匹配unicharset檔案，Tesseract將中止。

另一個可能出現的錯誤也是致命的——“Box file format error on line n”。If preceded by “Bad utf-8 char…” then the utf-8 codes are incorrect and need to be fixed. The error “utf-8 string too long…” indicates that you have exceeded the 24 byte limit on a character description. If you need a description longer than 24 bytes, please file an issue.

沒有必要對[lang].[fontname].exp[num].tr檔案的內容進行編輯。下面是tr檔案的一些格式：

Every character in the box file has a corresponding set of entries in
the .tr file (in order) like this
UnknownFont  2
mf 
x y length dir 0 0
… (there are a set of these determined by 
above)
cn 1
ypos length x2ndmoment y2ndmoment

The mf features are polygon segments of the outline normalized to the
1st and 2nd moments.
x= x position [-0.5.0.5]
y = y position [-0.25, 0.75]
length is the length of the polygon segment [0,1.0]
dir is the direction of the segment [0,1.0]

The cn feature is to correct for the moment normalization to
distinguish position and size (eg c vs C and , vs ‘)

6.5 計算字符集

Tesseract需要知道它tractor可以輸出的字符集。為了生成unicharset資料檔案，使用unicharset_ex程式處理box檔案：

unicharset_extractor lang.fontname.exp0.box lang.fontname.exp1.box …

Tesseract訪問字元的屬性：isalpha, isdigit, isupper, islower, ispunctuation。該資料而可以以unicharset的資料形式被編碼。每一行對應一個字元。UTF-8的後面是表示二進位制掩碼編碼性質的十六進位制數。每個bit表示一個屬性。如果bit設定為1，表示該屬性為真。這些為按順序排列（從最低位到最高位）: isalpha, islower, isupper, isdigit, ispunctuation。

例子：

“;”是標點，它的屬性表示為10000(0x10)。
“b”是字母表中的小寫字元，它的屬性表示為00011(0x3)。
“W”屬性為00101(0x5)。
“7”屬性為01000(0x8)。

“=”既不是標點也不是字母，屬性為00000(0x0)。

; 10 Common 46
b 3 Latin 59
W 5 Latin 40
7 8 Common 66
= 0 Common 93

中文和日文的字元也在最低位表示，如00001(0x1)。

如果你的系統支援寬字元處理函式，這些值都將通過unicharset_extractor自動設定，沒有必要自己編輯。老的系統(如Windows95)可能要手動編輯。

注意unicharset檔案必須在inttemp, normproto和pffmtable生成時重新生成（換言之，這些檔案在box檔案改變時必須重新建立）。

最後兩列表示指令碼型別(Latin, Common, Greek, Cyrillic, Han, null)和給定語言的字元編碼。

6.5.1 set_unicharset_properties(3.03版本新增)

在3.03中新的工具和資料檔案可以用來新增額外的屬性，大部分是字元的大小：

training/set_unicharset_properties -U input_unicharset -O output_unicharset —script_dir=training/langdata

6.6 font_properties(3.01版本新增)

在3.01中訓練過程需要font_properties檔案。這個檔案的目的是提供字型資訊，以便在識別出來的結果中給出字型資訊。font_properties是通過mftraining的-F filename選項處理的文字檔案。

font_properties每一行的格式為：

<fontname> <italic> <bold> <fixed> <serif> <fraktur>

其中<fontname>是一個字串(不能有空格)，其他<italic> <bold> <fixed> <serif> <fraktur>都是以0,1標記，表示該字型是否有該屬性。

當執行mftraining時，每個.tr檔名必須對應一個font_properties檔案，否則將中止。

例子：
font_properties檔案：

timesitalic 1 0 0 1 0

shapeclustering -F font_properties -U unicharset eng.timesitalic.exp0.tr
mftraining -F font_properties -U unicharset -O eng.unicharset eng.timesitalic.exp0.tr

注意在3.03中，training/langdata/font_properties是預設的font_properties檔案，包含3000個字型(不一定需要)。

6.7 聚類

當所有的訓練頁中的字元特徵已將被提取，我們需要對它們進行聚類。字元形狀可以通過 shapeclustering (3.02版本後允許), mftraining and cntraining programs程式進行聚類：

shapeclustering -F font_properties -U unicharset lang.fontname.exp0.tr lang.fontname.exp1.tr …

shapeclustering通過形狀聚類建立一個主字形表，並且寫入檔案-shapetable。

注意：如果你沒有執行shapeclustering,mftraining也會產生一個shapetable.你必須在你的traindata中包含這個shapetable，不管shapeclustering是否執行過。

mftraining -F font_properties -U unicharset -O lang.unicharset lang.fontname.exp0.tr lang.fontname.exp1.tr …

-U 檔案表示由上面的unicharset_extractor生成的unicharset，lang.unicharset是輸出檔案，提供給combine_tessdata處理。mftraining會輸出另外兩個資料檔案：inttemp(字元形狀原型the shape prototypes)和pffmtable(每個字元期望的特徵個數the number of expected features for each character)。（另外一個檔案Microfeat也會生成，但是沒有用。）

cntraining lang.fontname.exp0.tr lang.fontname.exp1.tr …

這個生成normproto資料檔案（字元歸一化敏感度原型the character normalization sensitivity prototypes）。

6.8 字典資料(可選)

Tesseract為每個語言最多使用8個字典檔案。它們都是可選的，用來幫助Tesseract不同字元組合的可能性。

有7個檔案編碼為Directed Acyclic Word Graph(DAWG)，另一個是簡單的UTF-8文字：

為了生成DAWG字典檔案，你首先需要你所訓練語言的單詞表。你可以從拼寫檢查中發現一個合適的字典(如ispell, aspell和hunspell)————注意license。單詞表用UTF-8的格式表示的話一個單詞一行。把單詞量拆成需要的集合，如：頻繁出現的單詞，剩餘的單詞，然後使用wordlist2dawg用來生成DAWG檔案：

Name	Type	Description
word-dawg	dawg	A dawg made from dictionary words from the language.
freq-dawg	dawg	A dawg made from the most frequent words which would have gone into word-dawg.
punc-dawg	dawg	A dawg made from punctuation patterns found around words. The “word” part is replaced by a single space.
number-dawg	dawg	A dawg made from tokens which originally contained digits. Each digit is replaced by a space character.
fixed-length-dawgs	dawg	Several dawgs of different fixed lengths —— useful for languages like Chinese.
bigram-dawg	dawg	A dawg of word bigrams where the words are separated by a space and each digit is replaced by a ?.
unambig-dawg	dawg	TODO: Describe.
user-words	dawg	A list of extra words to add to the dictionary. Usually left empty to be added by users if they require it; see tesseract(1).

wordlist2dawg frequent_words_list lang.freq-dawg lang.unicharset
wordlist2dawg words_list lang.word-dawg lang.unicharset

注意如果合併的traindata中包含字典，字典不能為空。

如果你需要字典的例子檔案，解壓(通過combine_tessdata)已有的語言檔案(如eng.traineddata)並且使用dawg2wordlist提取單詞表。

6.9 最後一個檔案(unicharambigs)

這個檔案描述了字元之間的模糊集。通常都是手動生成。為了理解格式，檢視以下示例：

v1
2       ‘ ‘     1       “       1
1       m       2       r n     0
3       i i i   1       m       0

第一行是一個版本識別符號。下面的行以tab製表符分割，使用下面的格式:

type indicator可以有下面的值：

Value	Type
0	A non-mandatory substitution. This informs tesseract to consider the ambiguity as a hint to the segmentation search that it should continue working if replacement of ‘source’ with ‘target’ creates a dictionary word from a non-dictionary word. Dictionary words that can be turned to another dictionary word via the ambiguity will not be used to train the adaptive classifier.
1	A mandatory substitution. This informs tesseract to always replace the matched ‘source’ with the ‘target’ strings.

Example line	Explanation
2 ‘ ‘ 1 “ 1	A double quote (“) should be substituted whenever 2 consecutive single quotes (‘) are seen.
1 m 2 r n 0	The characters ‘rn’ may sometimes be recognized incorrectly as ‘m’.
3 i i i 1 m 0	The character ‘m’ may sometimes be recognized incorrectly as the sequence ‘iii’.

每個字元都必須在unicharset中有包含。就是說，這些字元都應該是訓練語言的一部分。

3.03版本支援新的更簡單的格式：

v2
‘’ “ 1
m rn 0
iii m 0

其中1表示強制，0表示可選。

unicharambigs檔案也是可選的。

6.10 合併所有檔案

全部情況就這樣！現在你只需要合併所有的檔案(shapetable, normproto, inttemp, pffmtable)，用相同的字首重新命名它們，如lang.，這裡的lang是3個字元碼，對於你要訓練的語言名字，可以在http://en.wikipedia.org/wiki/List_of_ISO_639-2_codes中找到對應的。然後執行combine_tessdata:

combine_tessdata lang.

注意：不要忘記最後一個點！
結果的lang.traineddata在你的tessdata目錄中。然後你就可以用你訓練的語言識別文字了：

tesseract image.tif output -l lang

更多的選項請請參考combine_tessdata的手冊或者原始碼。