搭建CNN模型破解網站驗證碼!Python大法真的好!
專案介紹
在文章CNN大戰驗證碼中,我們利用TensorFlow搭建了簡單的CNN模型來破解某個網站的驗證碼。驗證碼如下:
網站驗證碼
在本文中,我們將會用Keras來搭建一個稍微複雜的CNN模型來破解以上的驗證碼。
資料集
對於驗證碼圖片的處理過程在本文中將不再具體敘述,有興趣的讀者可以參考文章CNN大戰驗證碼。
在這個專案中,我們現在的樣本一共是1668個樣本,每個樣本都是一個字元圖片,字元圖片的大小為16*20。樣本的特徵為字元圖片的畫素,0代表白色,1代表黑色,每個樣本為320個特徵,取值為0或1,特徵變數名稱為v1到v320,樣本的類別標籤即為該字元。整個資料集的部分如下:
data.csv(部分)
進群:548377875 即可獲取數十套PDF以及大量的學習教程,從零開始的哦!
CNN模型
利用Keras可以快速方便地搭建CNN模型,本文搭建的CNN模型如下:
CNN模型示意圖
將資料集分為訓練集和測試集,佔比為8:2,該模型訓練的程式碼如下:
# -*- coding: utf-8 -*- import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from matplotlib import pyplot as plt from keras.utils import np_utils, plot_model from keras.models import Sequential from keras.layers.core import Dense, Dropout, Activation, Flatten from keras.callbacks import EarlyStopping from keras.layers import Conv2D, MaxPooling2D # 讀取資料 df = pd.read_csv('F://verifycode_data/data.csv') # 標籤值 vals = range(31) keys = ['1','2','3','4','5','6','7','8','9','A','B','C','D','E','F','G','H','J','K','L','N','P','Q','R','S','T','U','V','X','Y','Z'] label_dict = dict(zip(keys, vals)) x_data = df[['v'+str(i+1) for i in range(320)]] y_data = pd.DataFrame({'label':df['label']}) y_data['class'] = y_data['label'].apply(lambda x: label_dict[x]) # 將資料分為訓練集和測試集 X_train, X_test, Y_train, Y_test = train_test_split(x_data, y_data['class'], test_size=0.3, random_state=42) x_train = np.array(X_train).reshape((1167, 20, 16, 1)) x_test = np.array(X_test).reshape((501, 20, 16, 1)) # 對標籤值進行one-hot encoding n_classes = 31 y_train = np_utils.to_categorical(Y_train, n_classes) y_val = np_utils.to_categorical(Y_test, n_classes) input_shape = x_train[0].shape # CNN模型 model = Sequential() # 卷積層和池化層 model.add(Conv2D(32, kernel_size=(3, 3), input_shape=input_shape, padding='same')) model.add(Activation('relu')) model.add(Conv2D(32, kernel_size=(3, 3), padding='same')) model.add(Activation('relu')) model.add(MaxPooling2D(pool_size=(2, 2), padding='same')) # Dropout層 model.add(Dropout(0.25)) model.add(Conv2D(64, kernel_size=(3, 3), padding='same')) model.add(Activation('relu')) model.add(Conv2D(64, kernel_size=(3, 3), padding='same')) model.add(Activation('relu')) model.add(MaxPooling2D(pool_size=(2, 2), padding='same')) model.add(Dropout(0.25)) model.add(Conv2D(128, kernel_size=(3, 3), padding='same')) model.add(Activation('relu')) model.add(Conv2D(128, kernel_size=(3, 3), padding='same')) model.add(Activation('relu')) model.add(MaxPooling2D(pool_size=(2, 2), padding='same')) model.add(Dropout(0.25)) model.add(Flatten()) # 全連線層 model.add(Dense(256, activation='relu')) model.add(Dropout(0.5)) model.add(Dense(128, activation='relu')) model.add(Dense(n_classes, activation='softmax')) model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) # plot model plot_model(model, to_file=r'./model.png', show_shapes=True) # 模型訓練 callbacks = [EarlyStopping(monitor='val_acc', patience=5, verbose=1)] batch_size = 64 n_epochs = 100 history = model.fit(x_train, y_train, batch_size=batch_size, epochs=n_epochs, verbose=1, validation_data=(x_test, y_val), callbacks=callbacks) mp = 'F://verifycode_data/verifycode_Keras.h5' model.save(mp) # 繪製驗證集上的準確率曲線 val_acc = history.history['val_acc'] plt.plot(range(len(val_acc)), val_acc, label='CNN model') plt.title('Validation accuracy on verifycode dataset') plt.xlabel('epochs') plt.ylabel('accuracy') plt.legend() plt.show()
在上述程式碼中,我們訓練模型的時候採用了early stopping技巧。early stopping是用於提前停止訓練的callbacks。具體地,可以達到當訓練集上的loss不在減小(即減小的程度小於某個閾值)的時候停止繼續訓練。
模型訓練
執行上述模型訓練程式碼,輸出的結果如下:
......(忽略之前的輸出) Epoch 22/100 64/1167 [>.............................] - ETA: 3s - loss: 0.0399 - acc: 1.0000 128/1167 [==>...........................] - ETA: 3s - loss: 0.1195 - acc: 0.9844 192/1167 [===>..........................] - ETA: 2s - loss: 0.1085 - acc: 0.9792 256/1167 [=====>........................] - ETA: 2s - loss: 0.1132 - acc: 0.9727 320/1167 [=======>......................] - ETA: 2s - loss: 0.1045 - acc: 0.9750 384/1167 [========>.....................] - ETA: 2s - loss: 0.1006 - acc: 0.9740 448/1167 [==========>...................] - ETA: 2s - loss: 0.1522 - acc: 0.9643 512/1167 [============>.................] - ETA: 1s - loss: 0.1450 - acc: 0.9648 576/1167 [=============>................] - ETA: 1s - loss: 0.1368 - acc: 0.9653 640/1167 [===============>..............] - ETA: 1s - loss: 0.1353 - acc: 0.9641 704/1167 [=================>............] - ETA: 1s - loss: 0.1280 - acc: 0.9659 768/1167 [==================>...........] - ETA: 1s - loss: 0.1243 - acc: 0.9674 832/1167 [====================>.........] - ETA: 0s - loss: 0.1577 - acc: 0.9639 896/1167 [======================>.......] - ETA: 0s - loss: 0.1488 - acc: 0.9665 960/1167 [=======================>......] - ETA: 0s - loss: 0.1488 - acc: 0.9656 1024/1167 [=========================>....] - ETA: 0s - loss: 0.1427 - acc: 0.9668 1088/1167 [==========================>...] - ETA: 0s - loss: 0.1435 - acc: 0.9669 1152/1167 [============================>.] - ETA: 0s - loss: 0.1383 - acc: 0.9688 1167/1167 [==============================] - 4s 3ms/step - loss: 0.1380 - acc: 0.9683 - val_loss: 0.0835 - val_acc: 0.9760 Epoch 00022: early stopping
可以看到,一共訓練了21次,最近一次的訓練後,在測試集上的準確率為96.83%。在測試集的準確率曲線如下圖:
測試集上的準確率曲線
模型預測
模型訓練完後,我們對新的驗證碼進行預測。新的100張驗證碼如下圖:
新的驗證碼(部分)
使用訓練好的CNN模型,對這些新的驗證碼進行預測,預測的Python程式碼如下:
# -*- coding: utf-8 -*- import os import cv2 import numpy as np def split_picture(imagepath): # 以灰度模式讀取圖片 gray = cv2.imread(imagepath, 0) # 將圖片的邊緣變為白色 height, width = gray.shape for i in range(width): gray[0, i] = 255 gray[height-1, i] = 255 for j in range(height): gray[j, 0] = 255 gray[j, width-1] = 255 # 中值濾波 blur = cv2.medianBlur(gray, 3) #模板大小3*3 # 二值化 ret,thresh1 = cv2.threshold(blur, 200, 255, cv2.THRESH_BINARY) # 提取單個字元 chars_list = [] image, contours, hierarchy = cv2.findContours(thresh1, 2, 2) for cnt in contours: # 最小的外接矩形 x, y, w, h = cv2.boundingRect(cnt) if x != 0 and y != 0 and w*h >= 100: chars_list.append((x,y,w,h)) sorted_chars_list = sorted(chars_list, key=lambda x:x[0]) for i,item in enumerate(sorted_chars_list): x, y, w, h = item cv2.imwrite('F://test_verifycode/chars/%d.jpg'%(i+1), thresh1[y:y+h, x:x+w]) def remove_edge_picture(imagepath): image = cv2.imread(imagepath, 0) height, width = image.shape corner_list = [image[0,0] < 127, image[height-1, 0] < 127, image[0, width-1]<127, image[ height-1, width-1] < 127 ] if sum(corner_list) >= 3: os.remove(imagepath) def resplit_with_parts(imagepath, parts): image = cv2.imread(imagepath, 0) os.remove(imagepath) height, width = image.shape file_name = imagepath.split('/')[-1].split(r'.')[0] # 將圖片重新分裂成parts部分 step = width//parts # 步長 start = 0 # 起始位置 for i in range(parts): cv2.imwrite('F://test_verifycode/chars/%s.jpg'%(file_name+'-'+str(i)), image[:, start:start+step]) start += step def resplit(imagepath): image = cv2.imread(imagepath, 0) height, width = image.shape if width >= 64: resplit_with_parts(imagepath, 4) elif width >= 48: resplit_with_parts(imagepath, 3) elif width >= 26: resplit_with_parts(imagepath, 2) # rename and convert to 16*20 size def convert(dir, file): imagepath = dir+'/'+file # 讀取圖片 image = cv2.imread(imagepath, 0) # 二值化 ret, thresh = cv2.threshold(image, 127, 255, cv2.THRESH_BINARY) img = cv2.resize(thresh, (16, 20), interpolation=cv2.INTER_AREA) # 儲存圖片 cv2.imwrite('%s/%s' % (dir, file), img) # 讀取圖片的資料,並轉化為0-1值 def Read_Data(dir, file): imagepath = dir+'/'+file # 讀取圖片 image = cv2.imread(imagepath, 0) # 二值化 ret, thresh = cv2.threshold(image, 127, 255, cv2.THRESH_BINARY) # 顯示圖片 bin_values = [1 if pixel==255 else 0 for pixel in thresh.ravel()] return bin_values def predict(VerifyCodePath): dir = 'F://test_verifycode/chars' files = os.listdir(dir) # 清空原有的檔案 if files: for file in files: os.remove(dir + '/' + file) split_picture(VerifyCodePath) files = os.listdir(dir) if not files: print('檢視的資料夾為空!') else: # 去除噪聲圖片 for file in files: remove_edge_picture(dir + '/' + file) # 對黏連圖片進行重分割 for file in os.listdir(dir): resplit(dir + '/' + file) # 將圖片統一調整至16*20大小 for file in os.listdir(dir): convert(dir, file) # 圖片中的字元代表的向量 files = sorted(os.listdir(dir), key=lambda x: x[0]) table = np.array([Read_Data(dir, file) for file in files]).reshape(-1,20,16,1) # 模型儲存地址 mp = 'F://verifycode_data/verifycode_Keras.h5' # 載入模型 from keras.models import load_model cnn = load_model(mp) # 模型預測 y_pred = cnn.predict(table) predictions = np.argmax(y_pred, axis=1) # 標籤字典 keys = range(31) vals = ['1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'J', 'K', 'L', 'N', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'X', 'Y', 'Z'] label_dict = dict(zip(keys, vals)) return ''.join([label_dict[pred] for pred in predictions]) def main(): dir = 'F://VerifyCode/' correct = 0 for i, file in enumerate(os.listdir(dir)): true_label = file.split('.')[0] VerifyCodePath = dir+file pred = predict(VerifyCodePath) if true_label == pred: correct += 1 print(i+1, (true_label, pred), true_label == pred, correct) total = len(os.listdir(dir)) print(' 總共圖片:%d張 識別正確:%d張 識別準確率:%.2f%%.' %(total, correct, correct*100/total)) main()
以下是該CNN模型的預測結果:
Using TensorFlow backend. 2018-10-25 15:13:50.390130: I C: f_jenkinsworkspace el-winMwindowsPY ensorflowcoreplatformcpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 1 ('ZK6N', 'ZK6N') True 1 2 ('4JPX', '4JPX') True 2 3 ('5GP5', '5GP5') True 3 4 ('5RQ8', '5RQ8') True 4 5 ('5TQP', '5TQP') True 5 6 ('7S62', '7S62') True 6 7 ('8R2Z', '8R2Z') True 7 8 ('8RFV', '8RFV') True 8 9 ('9BBT', '9BBT') True 9 10 ('9LNE', '9LNE') True 10 11 ('67UH', '67UH') True 11 12 ('74UK', '74UK') True 12 13 ('A5T2', 'A5T2') True 13 14 ('AHYV', 'AHYV') True 14 15 ('ASEY', 'ASEY') True 15 16 ('B371', 'B371') True 16 17 ('CCQL', 'CCQL') True 17 18 ('CFD5', 'GFD5') False 17 19 ('CJLJ', 'CJLJ') True 18 20 ('D4QV', 'D4QV') True 19 21 ('DFQ8', 'DFQ8') True 20 22 ('DP18', 'DP18') True 21 23 ('E3HC', 'E3HC') True 22 24 ('E8VB', 'E8VB') True 23 25 ('DE1U', 'DE1U') True 24 26 ('FK1R', 'FK1R') True 25 27 ('FK91', 'FK91') True 26 28 ('FSKP', 'FSKP') True 27 29 ('FVZP', 'FVZP') True 28 30 ('GC6H', 'GC6H') True 29 31 ('GH62', 'GH62') True 30 32 ('H9FQ', 'H9FQ') True 31 33 ('H67Q', 'H67Q') True 32 34 ('HEKC', 'HEKC') True 33 35 ('HV2B', 'HV2B') True 34 36 ('J65Z', 'J65Z') True 35 37 ('JZCX', 'JZCX') True 36 38 ('KH5D', 'KH5D') True 37 39 ('KXD2', 'KXD2') True 38 40 ('1GDH', '1GDH') True 39 41 ('LCL3', 'LCL3') True 40 42 ('LNZR', 'LNZR') True 41 43 ('LZU5', 'LZU5') True 42 44 ('N5AK', 'N5AK') True 43 45 ('N5Q3', 'N5Q3') True 44 46 ('N96Z', 'N96Z') True 45 47 ('NCDG', 'NCDG') True 46 48 ('NELS', 'NELS') True 47 49 ('P96U', 'P96U') True 48 50 ('PD42', 'PD42') True 49 51 ('PECG', 'PEQG') False 49 52 ('PPZF', 'PPZF') True 50 53 ('PUUL', 'PUUL') True 51 54 ('Q2DN', 'D2DN') False 51 55 ('QCQ9', 'QCQ9') True 52 56 ('QDB1', 'QDBJ') False 52 57 ('QZUD', 'QZUD') True 53 58 ('R3T5', 'R3T5') True 54 59 ('S1YT', 'S1YT') True 55 60 ('SP7L', 'SP7L') True 56 61 ('SR2K', 'SR2K') True 57 62 ('SUP5', 'SVP5') False 57 63 ('T2SP', 'T2SP') True 58 64 ('U6V9', 'U6V9') True 59 65 ('UC9P', 'UC9P') True 60 66 ('UFYD', 'UFYD') True 61 67 ('V9NJ', 'V9NH') False 61 68 ('V35X', 'V35X') True 62 69 ('V98F', 'V98F') True 63 70 ('VD28', 'VD28') True 64 71 ('YGHE', 'YGHE') True 65 72 ('YNKD', 'YNKD') True 66 73 ('YVXV', 'YVXV') True 67 74 ('ZFBS', 'ZFBS') True 68 75 ('ET6X', 'ET6X') True 69 76 ('TKVC', 'TKVC') True 70 77 ('2UCU', '2UCU') True 71 78 ('HNBK', 'HNBK') True 72 79 ('X8FD', 'X8FD') True 73 80 ('ZGNX', 'ZGNX') True 74 81 ('LQCU', 'LQCU') True 75 82 ('JNZY', 'JNZVY') False 75 83 ('RX34', 'RX34') True 76 84 ('811E', '811E') True 77 85 ('ETDX', 'ETDX') True 78 86 ('4CPR', '4CPR') True 79 87 ('FE91', 'FE91') True 80 88 ('B7XH', 'B7XH') True 81 89 ('1RUA', '1RUA') True 82 90 ('UBCX', 'UBCX') True 83 91 ('KVT5', 'KVT5') True 84 92 ('HZ3A', 'HZ3A') True 85 93 ('3XLR', '3XLR') True 86 94 ('VC7T', 'VC7T') True 87 95 ('7PG1', '7PQ1') False 87 96 ('4F21', '4F21') True 88 97 ('3HLJ', '3HLJ') True 89 98 ('1KT7', '1KT7') True 90 99 ('1RHE', '1RHE') True 91 100 ('1TTA', '1TTA') True 92 總共圖片:100張 識別正確:92張 識別準確率:92.00%.
可以看到,該訓練後的CNN模型,其預測新驗證的準確率在90%以上。
總結
在文章CNN大戰驗證碼中,筆者使用TensorFlow搭建了CNN模型,程式碼較長,訓練時間在兩個小時以上,而使用Keras搭建該模型,程式碼簡潔,且使用early stopping技巧後能縮短訓練時間,同時保證模型的準確率,由此可見Keras的優勢所在。
該專案已開源,Github地址為:https://github.com/percent4/CNN_4_Verifycode。