python tesseract-ocr 基礎驗證碼識別功能(Windows)
一、環境
windows 7 x64
Python 3 +
二、安裝
1、tesseract-ocr安裝
http://digi.bib.uni-mannheim.de/tesseract/
2、pytesseract安裝
pip install pytesseract
3、Pillow 安裝
pip install pillow
三、使用
#! -*- coding:utf-8 -*- import pytesseract from PIL import Image pytesseract.pytesseract.tesseract_cmd= ‘c://Program Files (x86)//Tesseract-OCR//tesseract.exe‘ tessdata_dir_config = ‘--tessdata-dir "c://Program Files (x86)//Tesseract-OCR//tessdata"‘ def main(): image = Image.open(‘code.png‘) code = pytesseract.image_to_string(image, lang = ‘eng‘, config=tessdata_dir_config) print (code)if __name__ == ‘__main__‘: main()
四、心得、遇到的坑
1、在 Windows 環境下的支持沒有那麽好,單單導入 import pytesseract 包,會一直報 Not Find 的錯誤。
原因:沒有找到安裝步驟中的 tesseract-ocr 應用程序,需在代碼中加入引用:
pytesseract.pytesseract.tesseract_cmd = ‘c://Program Files (x86)//Tesseract-OCR//tesseract.exe‘
2、image_to_string 需要重載兩個參數,大概的理解,
lang = ‘eng‘ 會找到 tessdate_dir_config 下配置路徑下的 tessdata 文件夾下的 eng.traineddata 文件,
config= 則是引用路徑
可以根據 tessdata 目錄下的 *.traineddata 文件進行配置不同的識別庫(不知道是否正確,大概的理解是這樣)
錯誤提示:
Traceback (most recent call last):
File "D:\***\VerifyCodeTest\src\main.py", line 17, in <module>
main()
File "D:\***\VerifyCodeTest\src\main.py", line 11, in main
code = pytesseract.image_to_string(image, lang = ‘eng‘, config=tessdata_dir_config)
File "C:\Users\*\AppData\Local\Programs\Python\Python36\lib\site-packages\pytesseract\pytesseract.py", line 193, in image_to_string
return run_and_get_output(image, ‘txt‘, lang, config, nice)
File "C:\Users\*\AppData\Local\Programs\Python\Python36\lib\site-packages\pytesseract\pytesseract.py", line 140, in run_and_get_output
run_tesseract(**kwargs)
File "C:\Users\*\AppData\Local\Programs\Python\Python36\lib\site-packages\pytesseract\pytesseract.py", line 111, in run_tesseract
proc = subprocess.Popen(command, stderr=subprocess.PIPE)
File "C:\Users\*\AppData\Local\Programs\Python\Python36\lib\subprocess.py", line 707, in __init__
restore_signals, start_new_session)
File "C:\Users\*\AppData\Local\Programs\Python\Python36\lib\subprocess.py", line 990, in _execute_child
startupinfo)
FileNotFoundError: [WinError 2] The system cannot find the file specified
Traceback (most recent call last):
File "D:\***\VerifyCodeTest\src\main.py", line 17, in <module>
main()
File "D:\***\VerifyCodeTest\src\main.py", line 11, in main
code = pytesseract.image_to_string(image)#, lang = ‘eng‘, config=tessdata_dir_config)
File "C:\Users\*\AppData\Local\Programs\Python\Python36\lib\site-packages\pytesseract\pytesseract.py", line 193, in image_to_string
return run_and_get_output(image, ‘txt‘, lang, config, nice)
File "C:\Users\*\AppData\Local\Programs\Python\Python36\lib\site-packages\pytesseract\pytesseract.py", line 140, in run_and_get_output
run_tesseract(**kwargs)
File "C:\Users\*\AppData\Local\Programs\Python\Python36\lib\site-packages\pytesseract\pytesseract.py", line 116, in run_tesseract
raise TesseractError(status_code, get_errors(error_string))
pytesseract.pytesseract.TesseractError: (1, ‘Error opening data file \\Program Files (x86)\\Tesseract-OCR\\eng.traineddata Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory. Failed loading language \‘eng\‘ Tesseract couldn\‘t load any languages! Could not initialize tesseract.‘)
參考自:https://blog.csdn.net/a349458532/article/details/51490291
python tesseract-ocr 基礎驗證碼識別功能(Windows)