1. 程式人生 > >python呼叫tesseract API 使用LSTM模式

python呼叫tesseract API 使用LSTM模式

前面已經寫過如何使用python呼叫tesseract API了,這裡說的是如何使用tesseract的LSTM模式。tesseract 4.0已經加入LSTM了,在用命令列執行的時候,新增 “–oem 1”引數即可,但是pythonocr模組裡並沒有提供使用oem引數的init函式,檢視tesseract的原始碼,capi.cpp定位到257行有:

TESS_API int TESS_CALL TessBaseAPIInit1(TessBaseAPI* handle, const char* datapath, const char* language, TessOcrEngineMode oem,
                                        char
** configs, int configs_size) { return handle->Init(datapath, language, oem, configs, configs_size, nullptr, nullptr, false); } TESS_API int TESS_CALL TessBaseAPIInit2(TessBaseAPI* handle, const char* datapath, const char* language, TessOcrEngineMode oem) { return handle->Init(datapath, language, oem); } TESS_API int
TESS_CALL TessBaseAPIInit3(TessBaseAPI* handle, const char* datapath, const char* language) { return handle->Init(datapath, language); }

其中TessBaseAPIInit2()函式就是我們需要的,其實已經匯出在了tesseract.so檔案中,需要我們宣告一下才能使用。開啟pythonocr安裝目錄下的tesseract_raw.py檔案,定位到148行,可以看到對init1和init3的函式宣告,那麼加入init2的函式宣告即可,修改後如下:

    g_libtesseract.TessBaseAPIInit1.argtypes = [
        ctypes.c_void_p,  # TessBaseAPI*
        ctypes.c_char_p,  # datapath
        ctypes.c_char_p,  # language
        ctypes.c_int,  # TessOcrEngineMode
        ctypes.POINTER(ctypes.c_char_p),  # configs
        ctypes.c_int,  # configs_size
    ]
    g_libtesseract.TessBaseAPIInit1.restype = ctypes.c_int

    # 新增的對init2的函式宣告
    g_libtesseract.TessBaseAPIInit2.argtypes = [
        ctypes.c_void_p,  # TessBaseAPI*
        ctypes.c_char_p,  # datapath
        ctypes.c_char_p,  # language
        ctypes.c_int,  # TessOcrEngineMode
    ]
    g_libtesseract.TessBaseAPIInit2.restype = ctypes.c_int

    g_libtesseract.TessBaseAPIInit3.argtypes = [
        ctypes.c_void_p,  # TessBaseAPI*
        ctypes.c_char_p,  # datapath
        ctypes.c_char_p,  # language
    ]
    g_libtesseract.TessBaseAPIInit3.restype = ctypes.c_int

然後定位到351行,這裡是pythonocr的init函式實現,修改成如下:

def init(lang=None, oem = 0):
    assert(g_libtesseract)
    handle = g_libtesseract.TessBaseAPICreate()
    try:
        if lang:
            lang = lang.encode("utf-8")
        prefix = None
        if TESSDATA_PREFIX:
            prefix = TESSDATA_PREFIX.encode("utf-8")
        g_libtesseract.TessBaseAPIInit2(
            ctypes.c_void_p(handle),
            ctypes.c_char_p(prefix),
            ctypes.c_char_p(lang),
            oem
        )
        g_libtesseract.TessBaseAPISetVariable(
            ctypes.c_void_p(handle),
            b"tessedit_zero_rejection",
            b"F"
        )
    except:
        g_libtesseract.TessBaseAPIDelete(ctypes.c_void_p(handle))
        raise
    return handle

在外部呼叫的時候,只需要將以前的

handle = tesseract_raw.init(lang='eng')

修改成:

handle = tesseract_raw.init(lang='eng', oem=1)

即可。下載最新支援lstm的tessdata資料包,識別結果會比之前有大大的提高!如何在呼叫API的時候使用多語言,就如同命令列下的 -l eng+chi這種,還在摸索中,如果誰知道,請麻煩告知,謝謝!