1. 程式人生 > >Scrapy框架爬取有驗證碼的登入網站

Scrapy框架爬取有驗證碼的登入網站

使用Scrapy爬取91pron網站

**宣告:本專案旨在學習Scrapy爬蟲框架和MongoDB資料庫,不可使用於商業和個人其他意圖。若使用不當,均由個人承擔。**

首先,我們需要將scrapy框架所需的各種包,安裝好,我們就開始了!

開啟將要放專案的資料夾,在cmd中建立scrapy專案!

scrapy startproject yelloweb

scrapy資料夾中的各個檔案就不介紹作用了。不懂的請先百度一下。
開啟yelloweb資料夾下的items.ty

import scrapy


class YellowebItem(scrapy.Item):
    # define the fields for your item here like:
# name = scrapy.Field() title = scrapy.Field() # 視訊標題 link = scrapy.Field() # 視訊連結 img = scrapy.Field() # 封面圖片連結

然後開啟 yelloweb資料夾下的spiders資料夾,建立yellowebSpider.py
我們開啟它:
先粘程式碼:

import scrapy


class yellowebSpider(scrapy.Spider):
    name = "webdata"  # 爬蟲的識別名,它必須是唯一的
allowed_domains = ["91.91p17.space"] start_urls = [ # 爬蟲開始爬的一個URL列表 "http://91.91p17.space/index.php" ] def parse(self, response): pass

開始嘍!
我們首先要解決的問題就是最難克服的問題,就是如何登陸!
首先進入該網站的登入頁面!

    def start_requests(self):
        return [Request("http://91.91p17.space/login.php"
, callback=self.login, meta={"cookiejar":1})] def login(self, response): pass # 這裡一會敲如何登陸

用request去跳轉頁面! callback=self.login 的意思是將用login函式處理。meta={“cookiejar”:1} 這裡也是有些不太懂,就不說了,有知道的請在下面評論,謝了!

接下來我們處理登入
上程式碼:

        headers={
        "GET /index.php HTTP/1.1"
        "Host": "91.91p17.space",
        "Connection": "keep-alive",
        "Cache-Control": "max-age=0",
        "Upgrade-Insecure-Requests": "1",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36",
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
        "Referer": "http://91.91p17.space/login.php",
        "Accept-Encoding": "gzip, deflate",
        "Accept-Language": "zh-CN,zh;q=0.8"
    }
    def login(self, response):
        print("準備開始模擬登入!")
        captcha_image = response.xpath('//*[@id="safecode"]/@src').extract()
        print(urljoin("http://91.91p17.space", captcha_image[0]))
        if ( len(captcha_image) > 0):
            # 擬定檔名與儲存路徑
            localpath = "D:\SoftWare\Soft\WorkSpace\Python\scrapy\code\captcha.png"


            opener=urllib.request.build_opener()
            opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1941.0 Safari/537.36')]
            urllib.request.install_opener(opener)
            urllib.request.urlretrieve(urljoin("http://91.91p17.space", captcha_image[0]), localpath)


            print("此次登入有驗證碼,請檢視本地captcha圖片輸入驗證碼:")
            captcha_value = input()
            data = {
                "username": "這裡填使用者名稱",
                "password": "這裡填密碼",
                "fingerprint": "1838373130",
                "fingerprint2": "1a694ef42547498d2142328d89e38c22",
                "captcha_input": captcha_value,
                "action_login": "Log In",
                "x": "54",
                "y": "21"
            }
        else:
            print("登入時沒有驗證碼!程式碼又寫錯了!")
        # print(data)
        print("驗證碼對了!!!!")
        return [FormRequest.from_response(response,
                                          # 設定cookie資訊
                                          meta={'cookiejar': response.meta['cookiejar']},
                                          # 設定headers資訊模擬瀏覽器
                                          headers=self.headers,
                                          formdata=data,
                                          callback=self.next
                                          )]
        def next(self, response):
            pass # 這裡寫處理網站的程式碼

程式碼有些長,慢慢解釋:
因為接下來我們需要headers,要偽裝成瀏覽器,不然,網站設定的反爬處理。
headers中的東西,使用Chrome中的F12中的Network查的:

這裡寫圖片描述

接下來,處理登入,煩了我好幾天,驗證碼真討厭!
通過chrome中的F12,找到驗證碼的連結,並且複製它的Xpath,

 captcha_image = response.xpath('//*[@id="safecode"]/@src').extract()

又因為這個連結是相對地址,我們需要處理一下,就得到它的絕對地址了:

urljoin("http://91.91p17.space", captcha_image[0])

因為沒有學習機器語言,只能手動輸入嘍!

   opener=urllib.request.build_opener()
            opener.addheaders=[('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1941.0 Safari/537.36')]
            urllib.request.install_opener(opener)

這3行程式碼,是最關鍵的,因為下載驗證碼的圖片竟然也設定了反爬處理。真是*了狗了。
好在找到了方法。因為urlretrieve遠端下載資源時不能新增headers,所以需要上面的程式碼。在下載的時候也偽裝成瀏覽器。
bingo!!
驗證碼就下載好了!真爽!
data字典中的資料就是我們所需要向網站提交的東西!
我們用FormRequest.from_response()向網站提交資訊!

我們這就輕輕鬆鬆的進入到了網站裡了。哈!哈!

既然進入到網站,那肯定不能就“取”一點點東西吧!
發現,咦!有一個更多視訊,果斷跳轉到那個介面:

  def next(self, response):
        href = response.xpath('//*[@id="tab-featured"]/div/a/@href').extract()
        url=urljoin("http://91.91p17.space", href[0])
        # print("\n\n\n\n\n\n"+url+"\n\n\n\n\n\n")
        yield scrapy.http.Request(url, meta={'cookiejar': response.meta['cookiejar']},
                                          # 設定headers資訊模擬瀏覽器
                                          headers=response.headers, callback=self.parse)
    def parse(self, response):
        pass # 解析網頁

這個同理,要把相對地址,轉化成絕對地址。
然後我們就開始爬取吧!

    def parse(self, response):
        sel = Selector(response)
        print("進入更多精彩視訊了")

        web_list = sel.css('.listchannel')
        for web in web_list:


            item = YellowebItem()
            try:
                item['link'] = web.xpath('a/@href').extract()[0]
                url = response.urljoin(item['link'])
                yield scrapy.Request(url, meta={'cookiejar': response.meta['cookiejar']}, callback=self.parse_content, dont_filter=True)
            except:
                print("完蛋了。。。。")
            # 跳轉下一個頁面

            href = response.xpath('//*[@id="paging"]/div/form/a[6]/@href').extract()
            nextPage = urljoin("http://91.91p17.space/video.php", href[0])
            print(nextPage)
            if nextPage:
                yield scrapy.http.Request(nextPage, meta={'cookiejar': response.meta['cookiejar']},
                                          # 設定headers資訊模擬瀏覽器
                                          headers=response.headers, callback=self.parse)


    def parse_content(self, response):
            try:
                name = response.xpath('//*[@id="head"]/h3/a[1]/text()').extract()[0]

                item = YellowebItem()
                item['link'] = response.xpath('///*[@id="vid"]//@src').extract()[0]
                item['title'] = response.xpath('//*[@id="viewvideo-title"]/text()').extract()[0].strip()
                item['img'] = response.xpath('//*[@id="vid"]/@poster').extract()[0]
                yield item
            except:
                print("完蛋了。。。爬不下來了。。。")

這是通過css選擇器加上xpath選擇器提取的!
一個91網就爬下來了。
看執行結果:

D:\SoftWare\Soft\WorkSpace\Python\scrapy\yelloweb>scrapy crawl webdata
2017-10-25 21:03:44 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: yelloweb)
2017-10-25 21:03:44 [scrapy.utils.log] INFO: Overridden settings: {'SPIDER_MODULES': ['yelloweb.spiders'], 'NEWSPIDER_MODULE': 'yelloweb.spiders', 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)', 'BOT_NAME': 'yelloweb'}
2017-10-25 21:03:44 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2017-10-25 21:03:44 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-10-25 21:03:45 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-10-25 21:03:45 [scrapy.middleware] INFO: Enabled item pipelines:
['yelloweb.pipelines.YellowebPipeline']
2017-10-25 21:03:45 [scrapy.core.engine] INFO: Spider opened
2017-10-25 21:03:45 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-10-25 21:03:45 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-10-25 21:03:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://91.91p17.space/login.php> (referer: None)
準備開始模擬登入!
http://91.91p17.space/captcha.php
此次登入有驗證碼,請檢視本地captcha圖片輸入驗證碼:

開啟你儲存驗證碼的地址,檢視驗證碼:

此次登入有驗證碼,請檢視本地captcha圖片輸入驗證碼:
4541
驗證碼對了!!!!
2017-10-25 21:05:11 [scrapy.extensions.logstats] INFO: Crawled 1 pages (at 1 pages/min), scraped 0 items (at 0 items/min)
2017-10-25 21:05:12 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET http://91.91p17.space/index.php> from <POST http://91.91p17.space/login.php>
2017-10-25 21:05:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://91.91p17.space/index.php> (referer: http://91.91p17.space/login.php)

2017-10-25 21:05:45 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 1 pages/min), scraped 0 items (at 0 items/min)
2017-10-25 21:06:45 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

然後就變成這樣了,
好像卡了一般。但是等等就會繼續執行。
可能是網不好??請大神解答!!!

完了就出結果了:


2017-10-25 21:08:14 [scrapy.core.scraper] DEBUG: Scraped from <200 http://91.91p17.space/view_video.php?viewkey=e231628214a5c5ea54ba&page=1&viewtype=basic&category=rf>
{'img': 'http://img2.t6k.co/thumb/240427.jpg',
 'link': 'http://192.240.120.100//mp43/240427.mp4?st=iQXkdUjR5J_1H2KjVY8WgQ&e=1509009304',
 'title': 'woman on top Guanyin sit lotus, [help to apply for highlight, '
          'thanks 91PORN platform, management audit fortunately]'}
2017-10-25 21:08:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://91.91p17.space/view_video.php?viewkey=247433dbac92ae91f6ff&page=1&viewtype=basic&category=rf> (referer: http://91.91p17.space/video.php?category=rf)
2017-10-25 21:08:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://91.91p17.space/view_video.php?viewkey=5ff48ed3ecc37745251b&page=1&viewtype=basic&category=rf> (referer: http://91.91p17.space/video.php?category=rf)
2017-10-25 21:08:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://91.91p17.space/view_video.php?viewkey=d5d24ee2936c086eb342&page=1&viewtype=basic&category=rf> (referer: http://91.91p17.space/video.php?category=rf)
2017-10-25 21:08:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://91.91p17.space/view_video.php?viewkey=358683d42298681fabe0&page=1&viewtype=basic&category=rf> (referer: http://91.91p17.space/video.php?category=rf)
2017-10-25 21:08:14 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://91.91p17.space/view_video.php?viewkey=835bf1fac457e9a8e9f6&page=1&viewtype=basic&category=rf> (referer: http://91.91p17.space/video.php?category=rf)

接下來我們儲存到資料庫中:
開啟pipeline.ty
程式碼如下:

import pymysql as db

class YellowebPipeline(object):
    def __init__(self):
        self.con = db.connect(user="root", passwd="root", host="localhost", db="python", charset="utf8")
        self.cur = self.con.cursor()
        self.cur.execute('drop table 91pron_content')
        self.cur.execute("create table 91pron_content(id int auto_increment primary key, title varchar(200), img varchar(244), link varchar(244))")

    def process_item(self, item, spider):
        self.cur.execute("insert into 91pron_content(id,title,img,link) values(NULL,%s,%s,%s)", (item['title'], item['img'], item['link']))
        self.con.commit()
        return item

同時在setting.ty中設定:


DOWNLOADER_MIDDLEWARES = {
    'yelloweb.middlewares.MyCustomDownloaderMiddleware': None,
}

一個簡簡單單的爬蟲就完了!!