2017.08.10 Python爬蟲實戰之爬蟲攻防

阿新 • • 發佈：2017-08-10

ebs 1-1 間隔 ima pic setting fin 數據 del

1.創建一般的爬蟲：一般來說，小於100次訪問的爬蟲都無須為此擔心

（1）以爬取美劇天堂為例，來源網頁：http://www.meijutt.com/new100.html，項目準備：

scrapy startproject meiju100

F:\Python\PythonWebScraping\PythonScrapyProject>cd meiju100

F:\Python\PythonWebScraping\PythonScrapyProject\meiju100>scrapy genspider meiju100Spider meijutt.com

技術分享

項目文件結構：

技術分享

（2）修改items.py文件：

技術分享

（3）修改meiju100Spider.py文件：

先檢查網頁源代碼：發現<div class="lasted-num fn-left">開頭的標簽，包含所需數據：

技術分享

# -*- coding: utf-8 -*-
import scrapy
from meiju100.items import Meiju100Item

class Meiju100spiderSpider(scrapy.Spider):
    name = ‘meiju100Spider‘
    allowed_domains = [‘meijutt.com‘]
    start_urls = (
        ‘http://www.meijutt.com/new100.html‘
    )

    def parse(self, response):
        subSelector=response.xpath(‘//li/div[@class="lasted-num fn-left"]‘)
        items=[]
        for sub in subSelector:
            item=Meiju100Item()
            item[‘storyName‘]=sub.xpath(‘../h5/a/text()‘).extract()[0]
            item[‘storyState‘]=sub.xpath(‘../span[@class="state1 new100state1"]/text()‘).extract()[0]
            item[‘tvStation‘]=sub.xpath(‘../span[@class="mjtv"]/text()‘).extract()
            item[‘updateTime‘]=sub.xpath(‘//div[@class="lasted-time new100time fn-right"]/text()‘).extract()[0]   //運行報錯：IndexError: list index out of range，<div class="lasted-time new100time fn-right">不屬於上邊的父節點 

            items.append(item)
            
        return items

技術分享

（4）編寫pipelinses.py文件，保存爬取的數據到文件夾：

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don‘t forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import time

class Meiju100Pipeline(object):
    def process_item(self, item, spider):
        today=time.strftime(‘%Y%m%d‘,time.localtime())
        fileName=today+‘meiju.txt‘
        with open(fileName,‘a‘) as fp:
            fp.write("%s \t" %(item[‘storyName‘].encode(‘utf8‘)))
            fp.write("%s \t" %(item[‘storyState‘].encode(‘utf8‘)))
            if len(item[‘tvStation‘])==0:
                fp.write("unknow \t")
            else:
                fp.write("%s \t" %(item[‘tvStation‘][0]).encode(‘utf8‘))
            fp.write("%s \n" %(item[‘updateTime‘].encode(‘utf8‘)))

        return item

（5）修改settings.py文件：

技術分享

（6）在meiju項目下任意目錄下，運行命令：scrapy crawl meiju100Spider

運行結果：

技術分享

2.封鎖間隔時間破解：Scrapy在兩次請求之間的時間設置DOWNLOAD_DELAY,如果不考慮反爬蟲的因素，這個值當然是越小越好，

如果把DOWNLOAD_DELAY的值設置為0.1，也就是每0.1秒向網站請求一次網頁。

所以，需要在settings.py的尾部追加這一項即可：

技術分享

3.封鎖Cookies破解：總所周知，網站是通過Cookies來確定用戶身份的，Scrapy爬蟲在爬取數據時使用同一個Cookies發送請求，這種做法和把DOWNLOAD_DELAY設置為0.1沒什麽區別。

所以，要破解這種原理的反爬蟲也很簡單，直接禁用Cookies就可以了，在Setting.py文件後追加一項：

技術分享

2017.08.10 Python爬蟲實戰之爬蟲攻防

ebs 1-1 間隔 ima pic setting fin 數據 del 1.創建一般的爬蟲：一般來說，小於100次訪問的爬蟲都無須為此擔心（1）以爬取美劇天堂為例，來源網頁：http://www.meijutt.com/new100.html，項目準備： scrapy

2017.08.10 Python爬蟲實戰之爬蟲攻防

2017.08.10 Python爬蟲實戰之爬蟲攻防

2017.08.10 Python爬蟲實戰之爬蟲攻防篇

2017.08.04 Python網絡爬蟲之Scrapy爬蟲實戰二天氣預報

2017.08.04 Python網絡爬蟲之Scrapy爬蟲實戰二天氣預報的數據存儲問題

2017.08.11 Python網絡爬蟲實戰之Beautiful Soup爬蟲

python應用之爬蟲實戰1 爬蟲基本原理

Python爬蟲實戰之Requests+正則表示式爬取貓眼電影Top100

Python爬蟲實戰之爬取鏈家廣州房價_04鏈家的模擬登入(記錄)

Python爬蟲實戰之爬取B站番劇資訊(詳細過程)

Python爬蟲實戰之抓取淘寶MM照片（一）

Python爬蟲實戰--WeHeartIt爬蟲

Python爬蟲實戰--TripAdvisor爬蟲

爬蟲實戰之模擬登陸Github

Java 爬蟲專案實戰之爬蟲簡介

[python3.6]爬蟲實戰之爬取淘女郎圖片

python3 爬蟲實戰之爬取網易新聞APP端

Python進階之爬蟲url去重（可用於檔案去重）

爬蟲實踐之爬蟲框架Scrapy安裝

Python 新手實戰之機器學習實現簡單驗證碼識別(一)：用PIL簡單繪製驗證碼

【備忘】2017年深度學習專案實戰之對抗生成網路視訊課程

2017.08.10 Python爬蟲實戰之爬蟲攻防

相關推薦