Knn演算法智慧識別驗證碼數字

阿新 • • 發佈：2019-01-15

1. 首先，需要寫一個爬取圖片的程式，獲取大量驗證碼素材。下面用了python實現。

#coding=utf-8
import urllib
import re
import time
import socket
#USER_AGENT='Mozilla/5.0 (X11;Ubuntu;Linux x86_64;rv:40.0)Gecko/20100101 Firefox/40.0'
#HEADERS='User-agent:'+USER_AGENT
#print HEADERS
def getHtml(url):
    page = urllib.urlopen(url)
    html = page.read()
    return 
 html


def Schedule(a, b, c):
    '''
    a:已經下載的資料塊
    b:資料庫塊的大小
    c:遠端檔案的大小
    '''
    per = 100.0 * a * b / c
    if per > 100:
        per = 100
    print('%.2f%%' % per)

def auto_down(url,filename,Schedule):
    x=0
    try:
        urllib.urlretrieve(url,filename,Schedule)
    except urllib.ContentTooShortError:
        if 
 x<5:
            print 'Network conditions is not good.Reloading...%s' %x
            auto_down(url, filename, Schedule)
            x+=1
        else:
            print 'Download failed.Connecting to next image.'
            return False
    except socket.error:
        if x<5:
            print 
 'Socket error.Reloading...%s' %x
            auto_down(url, filename, Schedule)
            x+=1
        else:
            print 'Download failed.Connecting to next image.'
            return False
    '''else:
        if x<5:
            print 'Unknown error.Reloading...%s' %x
            auto_down(url, filename, Schedule)
            x+=1
        else:
            print 'Download failed.Connecting to next image.'
            return False'''
    return True

def getImg(html):
    reg = r'src="(.+?\.jpg)" pic_ext'
    imgre = re.compile(reg)
    imglist = re.findall(imgre,html)
    x = 1
    success_num=0
    failed_num=0
    for imgurl in imglist:
        print '=============Image No.%s=============' % x
        rst=auto_down(imgurl,'C:/Joe_Workspace/reptile_workspace/jpg/%s.jpg' % x,Schedule)
        print 'Image No.%s download finish.' % x
        x+=1
        if rst:
            success_num+=1
        else:
            failed_num+=1
        #time.sleep(2)
        rst_val=success_num/(success_num+failed_num)*100
    if rst_val ==100:
        print "[result] All %s images have been downloaded successfully." %success_num
    else:
        print "[result] %s/%s images have been downloaded successfully." %success_num %(success_num+failed_num)
    return imglist

html = getHtml("http://tieba.baidu.com/p/2460150866")

print getImg(html)

結果如下：

2. 從網上下載的驗證碼顏色形狀較多，為了使機器學習效果更顯著，我們先用程式生成簡單的驗證碼來做。下面使用java實現。

import java.awt.Color; 
import java.awt.Font; 
import java.awt.Graphics2D; 
import java.awt.image.BufferedImage; 
import java.io.File; 
import java.io.FileNotFoundException; 
import java.io.FileOutputStream; 
import java.io.IOException; 
import java.io.OutputStream; 
import java.util.Random; 

import javax.imageio.ImageIO; 

//import org.junit.Test; 

/**
 * @author : Administrator
 * @function :
 */ 
public class VerificationCode {
 private int w = 32; 
    private int h = 32; 
    private String text;
    private Random r = new Random(); 
    public static String[] fontNames_all = {"Arial","BatangChe","Bell MT","Arial Narrow","Arial Rounded MT Bold","Bookman Old Style","Bookshelf Symbil 7","Calbri Light","Calibri","Arial Black","Batang","Bodoni MT Black"};   
    Color bgColor = new Color(255, 255, 255);
 public static String[] fontNames=new String[1];;
 public static String codes="";

   public static void main(String[] args){
  for (int num=0;num<10;num++){
   for (int i=0;i<fontNames_all.length;i++){
    fontNames[0]=fontNames_all[i];
    codes=""+num;
    test_fun(num,i);
   }
  }
 }
    /**
     *
     */ 
    //@Test 
    public static void test_fun(int num,int i) { 
        VerificationCode vc = new VerificationCode(); 
        BufferedImage image = vc.getImage(); 
        try {
            VerificationCode.output(image, new FileOutputStream(new File( 
                    "C:\\Joe_Workspace\\image\\"+num+"_"+i+".jpg"))); 
        } catch (FileNotFoundException e) { 
            e.printStackTrace(); 
        } 
        System.out.println(vc.getText()); 
    } 

    /**
     *
     */ 
    public BufferedImage getImage() { 
        BufferedImage image = createImage();  
        Graphics2D g2 = (Graphics2D) image.getGraphics();  
        StringBuilder sb = new StringBuilder(); 

        for (int i = 0; i < 1; ++i) {
            String s = randomChar() + ""; 
            sb.append(s); 
            float x = i * 1.0F * w / 4 +9; 
            g2.setFont(randomFont()); 
            g2.setColor(randomColor()); 
            g2.drawString(s, x, h - 7); 
        } 

        this.text = sb.toString();
        //drawLine(image); 
        return image; 

    } 

    /**
     * @return
     */ 
    public String getText() { 
        return text; 
    } 

    /**
     * @param image
     * @param out
     *            
     */ 
    public static void output(BufferedImage image, OutputStream out) { 
        try { 
            ImageIO.write(image, "jpeg", out); 
        } catch (IOException e) { 
            e.printStackTrace(); 
        } 
    } 

    private void drawLine(BufferedImage image) { 
        Graphics2D g2 = (Graphics2D) image.getGraphics(); 
        for (int i = 0; i < 3; ++i) {
            int x1 = r.nextInt(w); 
            int y1 = r.nextInt(h); 
            int x2 = r.nextInt(w); 
            int y2 = r.nextInt(h); 
            g2.setColor(Color.BLUE); 
            g2.drawLine(x1, y1, x2, y2); 
        } 
    } 

    private Color randomColor() { 
        int red = r.nextInt(150); 
        int green = r.nextInt(150); 
        int blue = r.nextInt(150); 
        return new Color(0, 0, 0); 
    } 

    private Font randomFont() { 
        int index = r.nextInt(fontNames.length); 
        String fontName = fontNames[index]; 
        int style = r.nextInt(4); 
        int size = r.nextInt(5) + 24; 
        return new Font(fontName, style, size); 
    } 

    private char randomChar() { 
        int index = r.nextInt(codes.length()); 
        return codes.charAt(index); 
    } 

    private BufferedImage createImage() { 
        BufferedImage image = new BufferedImage(w, h, 
                BufferedImage.TYPE_INT_RGB); 
        Graphics2D g2 = (Graphics2D) image.getGraphics(); 
        g2.setColor(this.bgColor); 
        g2.fillRect(0, 0, w, h); 

        return image; 
    } 

}

3. 然後，將生成的驗證碼圖片轉化成字串矩陣存入txt。以下用python實現。

from PIL import Image
import os
from os import listdir

def img2txt_func(img_path_1,txt_path_1):
    fh=open(txt_path_1,'w')
    im=Image.open(img_path_1)
    fh=open(txt_path_1,'a')

    width=im.size[0]
    height=im.size[1]

    for i in range(0,width):
        for j in range(0,height):
            cl=im.getpixel((j,i))
            clall=cl[0]+cl[1]+cl[2]
            if(clall==0):#black
                fh.write("1")
            else:
                fh.write("0")
        fh.write("\n")

    fh.close()

img_path="c:/Joe_Workspace/images"
txt_path="c:/Joe_Workspace/traindata"
imgs=listdir(img_path)
for img in imgs:
    #print img_path+"/"+os.path.basename(img)
    #print txt_path+"/"+os.path.splitext(img)[0]+".txt"
    img2txt_func(img_path+"/"+os.path.basename(img),txt_path+"/"+os.path.splitext(img)[0]+".txt")

4. 最後，寫一個knn演算法去學習數字的程式。用python實現。

from numpy import *
import operator
from os import listdir
import os

from numpy.matlib import zeros

def pow_func_2nd_arry(dif,sqnum,arr_len):
    sqarr = zeros((len(dif),arr_len))
    for i in range(0,len(dif)):
        arr=[]
        arrlist=mat(dif[i]).tolist()
        #print arrlist[0][175]
        for j in range(0,arr_len):
            arr.append(int(arrlist[0][j])**sqnum)
        sqarr[i,:]=arr
    return sqarr

def pow_func_1nd_arry(dif,sqnum,arr_len):
    sqarr = zeros((1,arr_len))
    arrlist=mat(dif).tolist()
    #print sqarr
    #print arrlist[0][0]
    #print arrlist[675][0]
    #print arrlist[674][0]
    for i in range(0,arr_len):
        sqarr[:,i]=int(arrlist[i][0])**sqnum
    return sqarr

def array2list(dif,arr_len):
    list1=[]
    arrlist = mat(dif).tolist()
    for i in range(0,arr_len):
        list1.append(int(arrlist[0][i]))
    #print list1
    return list1

def knn(k,testdata,traindata,labels):
    traindatasize=traindata.shape[0]
    dif=tile(testdata,(traindatasize,1))-traindata
    #print dif[0][0]
    sqdif=pow_func_2nd_arry(dif,2,1024)
    sumsqdif=sqdif.sum(axis=1)
    distance=pow_func_1nd_arry(sumsqdif,0.5,len(labels))
    sortdistance=distance.argsort()
    sortdis_list=array2list(sortdistance,len(labels))
    #print sortdis_list
    count={}
    for i in range(0,k):
        vote=labels[int(sortdis_list[i])]
        count[vote]=count.get(vote,0)+1
    sortcount=sorted(count.items(),key=operator.itemgetter(1),reverse=True)
    return sortcount[0][0]
def data2array(fname):
    arr=[]
    fh=open(fname)
    for i in range(0,32):
        thisline=fh.readline()
        for j in range(0,32):
            arr.append(int(thisline[j]))
    #print arr
    return arr
def seplabel(fname):
    filestr=fname.split(".")[0]
    label=int(filestr.split("_")[0])
    return label
def traindata(train_data_path):
    labels=[]
    trainfile=listdir(train_data_path)
    num=len(trainfile)
    trainarr=zeros((num,1024))
    for i in range(0,num):
        thisname=trainfile[i]
        thislabel=seplabel(thisname)
        labels.append(thislabel)
        trainarr[i,:]=data2array(train_data_path+thisname)
    return trainarr,labels

test_data_path="c:/Joe_Workspace/testdata/"
train_data_path="c:/Joe_Workspace/traindata/"
trainarr,labels=traindata(train_data_path)
testfiles=listdir(test_data_path)
pass_count=0
for thistestfile in testfiles:
    testarr=data2array(test_data_path+os.path.basename(thistestfile))
    rst=knn(5,testarr,trainarr,labels)
    print os.path.basename(thistestfile)+":"+str(seplabel(thistestfile))+"?="+str(rst)
    if str(seplabel(thistestfile))==str(rst):
        pass_count+=1
print "[Pass Rate:] %s%%" %((pass_count*100)/len(testfiles))

Knn演算法智慧識別驗證碼數字

1. 首先，需要寫一個爬取圖片的程式，獲取大量驗證碼素材。下面用了python實現。 #coding=utf-8 import urllib import re import time import socket #USER_AGENT='Mozilla/5.

c#實現識別圖片上的驗證碼數字

這篇文章主要介紹了c#實現識別圖片上的驗證碼數字的方法，本文給大家彙總了2種方法，有需要的小夥伴可以參考下。 public void imgdo(Bitmap img) { //去色 Bitmap btp = img; Color c

基於KNN演算法實現的單個圖片數字識別

Test.csv中第1434行，圖片數字值為”0“,最終歸類為0，正確。 Test.csv中第14686行，圖片數字值為”8“,最終歸類為8，正確。 4原始碼最後附上本次基於KNN思想實現單個數字圖片識別的全部原始碼。 /** * @Title: DigitClassification.java

python+selenium識別驗證碼並登錄

from process rep 分享 tracking refresh 文章 rom fill python版本：3.4.3 所需要的代碼庫：PIL，selenium，tesseract 先上代碼： #coding:utf-8import subprocessfrom

python 簡單圖像識別--驗證碼Ⅲ

use 頁面 lte headers 新的 AR -a 提交進行 python 簡單圖像識別--驗證碼Ⅲ 實現自動登陸網站登錄學校圖書館管理系統為例，做一個簡單的例子。python識別簡單的沒有幹擾的純數字驗證碼還是可以的，但是識別字母數字再加上幹擾因素，誤報率很高，

python識別驗證碼

-m set TP exe github lstm 環境變量 alpha 添加 tesseract-ocr windows下載地址 http://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-setup-4.00.

爬蟲 - 用ocr來識別驗證碼

open roc pen bre ocr mage 灰度 mode 證明用OCR來識別直接識別效果不好，因為驗證碼內的多余線條幹擾了圖片的識別。先轉為灰度圖像，再二值化。經實踐證明，該方法不是100%正確。 # 獲取圖片 curl -X GET http://my.cn

使用百度ocr接口識別驗證碼

highlight 驗證碼 created name basic create turn words 地圖 #!/usr/bin/env python #created by Baird from aip import AipOcr def GetCaptchaV(f

用TensorFlow訓練卷積神經網路——識別驗證碼

需要用到的包：numpy、tensorflow、captcha、matplotlib、PIL、random import numpy as np import tensorflow as tf # 深度學習庫 from captcha.image import ImageCaptcha

python使用tesseract識別驗證碼

寫在最前面：遇到了一個很無語的坑。環境變數新增好以後，記得重啟IDE--pycharm，不然死活會報錯！！！本來想用於我司運維平臺的驗證碼識別的，結果截下來的圖太模糊了，強大的tesseract也無能為力。。。程式碼很簡單，下面是安裝步驟，具體的我的前面

學習筆記（八）：使用邏輯迴歸檢測JAVA溢位攻擊以及識別驗證碼

（1）檢測JAVA溢位攻擊 1.資料蒐集：載入ADFA-LD正常樣本資料，定義遍歷目錄下檔案的函式，從攻擊資料集中篩選和JAVA溢位攻擊相關的資料，原理同（四） 2.特徵化：與（四）一致，使用詞集模型 3.訓練樣本 logreg = linear_model.LogisticRegr

python 基於機器學習識別驗證碼

1、背景驗證碼自動識別在模擬登陸上使用的較為廣泛，一直有耳聞好多人在使用機器學習來識別驗證碼，最近因為剛好接觸這方面的知識，所以特定研究了一番。發現網上已有很多基於machine learning的驗證碼識別，本文主要參考幾位大牛的研究成果，集合自己的需求，進行改進、學習

Python3 識別驗證碼（opencv-python）

Python3 識別驗證碼（opencv-python）一、準備工作使用opencv做影象處理，所以需要安裝下面兩個庫： pip3 install opencv-python pip3 install numpy 二、識別原理採取一種有監督式學習的方法來識別驗證碼，包含以下幾個步驟：

centos7識別驗證碼字型

一段java寫的純字母和數字的登入驗證碼程式，在windows執行正常。部署到阿里雲，起初也是正常的。但經過對這臺阿里雲主機(CentOS 7.4)進行一系列的環境配置（nginx，ffmpeg等）後，驗證碼變成了亂碼，與後臺列印的字母數字完全不同。考慮到程式碼沒有變動，應當是環境的問題

tensorflow實戰：端到端簡單粗暴識別驗證碼（反爬利器）

今天分享一下如何簡單粗暴的解決驗證碼的辦法背景：對於一個爬蟲開發者來說，反爬蟲無疑是一個又愛又恨的對手，兩者之間通過鍵盤的鬥爭更是一個沒有硝煙的戰場。反爬蟲有很多措施，在這裡說說驗證碼這一塊論爬蟲修養：大家都是混口飯吃，上有老下有小，碼農何苦為難碼農？爬資

tensorflow實戰：端到端簡單粗暴識別驗證碼（反爬利器OA信用盤平臺可殺大賠小）

今天分享一OA信用盤平臺可殺大賠小（殺豬）QQ2952777280【話仙原始碼論壇】hxforum.com下如何簡單粗暴的解決驗證碼的辦法背景：對於一個爬蟲開發者來說，反爬蟲無疑是一個又愛又恨的對手，兩者之間通過鍵盤的鬥爭更是一個沒有硝煙的戰場。反爬蟲有很多措施，在這裡說說驗證碼這一塊論爬蟲修養：大家都是混口

RNN入門（二）識別驗證碼

介紹作為RNN的第二個demo，筆者將會介紹RNN模型在識別驗證碼方面的應用。我們的驗證碼及樣本資料集來自於部落格： CNN大戰驗證碼,在這篇部落格中，我們已經準備好了所需的樣本資料集，不需要在辛辛苦苦地再弄一遍，直接呼叫data.csv就可以進行建

【Python3爬蟲】使用雲打碼識別驗證碼

1 import json 2 import time 3 import requests 4 5 6 class YDMHttp: 7 apiurl = 'http://api.yundama.com/api.php' 8 username = ''

python 基於機器學習—深度學習識別驗證碼

一、前言開發環境：Anaconda | python 3.5 —pycharm / jupyter notebook 專案的整個識別流程： ① 驗證碼清理並生成訓練集樣本 ② 驗證碼特徵提取 ③ 擬

Keras初探（二）——識別驗證碼

訪問本站觀看效果更佳繼上篇對於Keras的初步探討之後，我將給出一個例子講解如何利用Keras用於處理影象分類問題，今天我們先探討一下識別驗證碼的問題。一、探討內容 1、資料來源 2、模型搭建 3、優化問題二、資料來源在本文中，我打算對驗證碼進行識別，有

Knn演算法智慧識別驗證碼數字

相關推薦