scrapy——抓取知乎

阿新 • • 發佈：2018-11-16

分享一下我老師大神的人工智慧教程！零基礎，通俗易懂！http://blog.csdn.net/jiangjunshow

也歡迎大家轉載本篇文章。分享知識，造福人民，實現我們中華民族偉大復興！

主要目標是：

· 從如何評價X的話題下開始抓取問題，然後開始爬相關問題再迴圈

對於每個問題抓取標題，關注人數，回答數等資料

1 建立專案

$ scrapy startproject zhihu

New Scrapy project 'zhihu', using template directory'/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Scrapy-1.3.2-py2.7.egg/scrapy/templates/project',created in:

/Users/huilinwang/zhihu

You can start your first spider with:

cd zhihu

scrapy genspider exampleexample.com

2 編輯SPIDER

在/zhihu/zhihu/spiders目錄中，建立zhihuspider.py檔案，具體內容檢視文件後面。

2.1 函式def start_requests(self):

This method must return aniterable with the first Requests to crawl for this spider.只調用一次。

當沒有指定URLs（定義在start_urls變數中）時候呼叫，如果指定了URLs則呼叫make_requests_from_url() 函式。

如果要通過登入來POST請求，可以如下程式碼：

class MySpider(scrapy.Spider):

name = 'myspider'

def start_requests(self):

return[scrapy.FormRequest("http://www.example.com/login",

formdata={'user': 'john', 'pass': 'secret'},

callback=self.logged_in)]

def logged_in(self, response):

# here you would extractlinks to follow and return Requests for

# each of them, with anothercallback

pass

例項中start_requests內容如下：

def start_requests(self):

yield scrapy.Request(

url=self.zhihu_url,

headers=self.headers_dict,

callback=self.request_captcha

)

其中zhihu_url,headers_dict分別在共享變數中定義。

request_captcha為回撥函式。然後呼叫回撥函式。

2.1.1 scrapy.Request類

位於scrapy/http/request/_init_.py檔案中,初始化函式如下：

def __init__(self, url, callback=None, method='GET', headers=None, body=None,

cookies=None, meta=None, encoding='utf-8', priority=0,

dont_filter=False, errback=None):

該模組實現Requst類，用於表示HTTP請求。官方文件（docs/topics/request-response.rst）

scrapy使用request和response物件來爬網站。request物件在spider中產生，在系統中傳遞直到達到downloader.下載器執行request並返回response物件。

request和response類都有子類，增加了在基類中沒有要求的函式。

request物件如下：

class scrapy.http.Request(url[, callback, method='GET', headers, body,cookies, meta, encoding='utf-8', priority=0, dont_filter=False, errback])

一個request物件表示一個HTTP請求，通過在SPIDER中產生，在下載器中執行，然後產生一個Response.

2.1.2 request的引數說明：

url (string) – 這個請求的URL，這個屬性是隻讀的，改變URL使用replace().

callback (callable) – 這個request（一旦請求下載後）會呼叫的函式。如果沒有指定callback,預設呼叫parse()函式。注意，如果處理過程發生意外，errback會被呼叫。

method (string) – 這個request的HTTP方法，預設是’GET’。

meta (dict) – Request的初始化值request.meta.如果有這個值，dict會被複制。

body (str or unicode) – request的主體。如果傳遞是一個unicode，會編碼到utf-8。如果主體沒有給定，空字串會被儲存。不管這個引數的型別，最後值會被儲存為str.

headers (dict) – 這個request的頭。可以是單獨值或者多值頭。如果沒有傳遞這個值，HTTP頭就不會被髮送。

cookies (dict or list) –request的cookies，可以用兩種方式傳送。

Using a dict:

request_with_cookies =Request(url="http://www.example.com",

cookies={'currency':'USD', 'country': 'UY'})

Using a list of dicts:

request_with_cookies =Request(url="http://www.example.com",

cookies=[{'name': 'currency',

'value': 'USD',

'domain': 'example.com',

'path': '/currency'}])

在後續requests的時候，儲存的cookies才有用。

有些站點返回cookies(作為響應），後續有requests時候會被再次傳送。這是通常的WEB瀏覽器行為。如果，出於某些原因考慮，要避免合併存在的cookies，可以設定dont_merge_cookies為True在request.meta.

例如不合並cookies的request:

request_with_cookies =Request(url="http://www.example.com",

cookies={'currency': 'USD','country': 'UY'},

meta={'dont_merge_cookies': True})

encoding (string) – 這個request的編碼（預設是為’utf-8’). 用於將字串轉換成給定的編碼。

priority (int) – 這個request的優先順序（預設是0）。優先順序用於被排程器排程順序。高的優先順序會被更早呼叫。負值表示相對的低優先順序。

dont_filter (boolean) – 表示這個request不會被排程器過濾。在一個request 需要被多次排程時候有效，需要忽略過濾器。使用這個需要小心，不然會迴圈爬行，預設是False。 errback (callable) – 當有異常時候會被呼叫。包括404 HTTP錯誤等。

2.2 函式def request_captcha(self, response):

是start_request的回撥函式，當下載器處理完request後返回response，該函式於處理response.

def request_captcha(self, response):

_xsrf = response.css('input[name="_xsrf"]::attr(value)').extract()[0]

captcha_url = "http://www.zhihu.com/captcha.gif?r=" + str(time.time() *1000)

yield scrapy.Request(

url=captcha_url,

headers=self.headers_dict,

callback=self.download_captcha

)

其中time.time() * 1000為獲取時間。返回當前時間的時間戳（1970紀元後經過的浮點秒數）

程式碼中

response.css('input[name="_xsrf"]::attr(value)').extract()[0]

獲取_xsrf對應的值，為獲取HTML原始碼中：

該函式最後呼叫函式download_captcha。

2.3 函式 def download_captcha(self, response):

函式如下：

def download_captcha(self, response):

with open("captcha.gif","wb") as fp:

fp.write(response.body)

os.system('opencaptcha.gif')

print "請輸入驗證碼:\n"

captcha = raw_input()

yield scrapy.FormRequest(

url=self.login_url,

headers=self.headers_dict,

formdata={

"email": email,

"password": password,

"_xsrf": response.meta["_xsrf"],

"remember_me": "true",

"captcha": captcha

callback=self.request_zhihu

)

該函式是request_captcha函式的回撥函式。

該函式主要是增加了scrapy.FormRequest函式

2.3.1 FormRequest

FormRequest類擴充套件了Request基類。使用lxml.html forms來預處理來自Response物件的資料。

類增加一個新的引數到構造器。其他引數和Request類一致。

classmethodfrom_response(response[, formname=None, formid=None, formnumber=0, formdata=None, formxpath=None, formcss=None, clickdata=None, dont_click=False, ...])

使用預填充的元素來，返回一個formrequest物件。

其中引數如下：

· formdata (dict) – fields to override in the formdata. If a field was already present in the response <form> element, its value is overridden by the one passedin this parameter.

表示form中覆蓋的資料。主要用於模擬HTML POST格式，傳送一對健值。

然後呼叫request_zhihu函式。

2.4 函式def request_zhihu(self, response):

程式碼如下：

def request_zhihu(self, response):

yield scrapy.Request(url=self.topic+'/19760570',

headers=self.headers_dict,

callback=self.get_topic_question,

dont_filter=True)

從https://www.zhihu.com/topic/19760570/hot開始

因為需要迴圈呼叫所以設定了dont_filter=True。

呼叫get_topic_question

2.5 函式def get_topic_question(self, response):

程式碼如下：

def get_topic_question(self,response):

# withopen("topic.html", "wb") asfp:

# fp.write(response.body)

question_urls = response.css(".question_link[target=_blank]::attr(href)").extract()

length = len(question_urls)

k = -1

j = 0

temp = []

for j in range(length/3):

temp.append(question_urls[k+3])

j+=1

k+=3

for url in temp:

yield scrapy.Request(url =self.zhihu_url+url,

headers = self.headers_dict,

callback = self.parse_question_data)

找到相對連結，即href屬性，這些就是知乎的TOPIC的相對連結。賦值給question_urls變數。然後間隔提取其中的TOPIC URL片段，最後繼續呼叫Request，不過URL發生了變化，變成了一個拼接的URL。然後呼叫parse_question_data

2.6 函式def parse_question_data(self, response):

是蜘蛛的最後一個函式，如下：

def parse_question_data(self,response):

item = ZhihuItem()

item["qid"] = re.search('\d+',response.url).group()

item["title"] = response.css(".zm-item-title::text").extract()[0].strip()

item["answers_num"] = response.css("h3::attr(data-num)").extract()[0]

question_nums = response.css(".zm-side-section-inner .zg-gray-normalstrong::text").extract()

item["followers_num"] = question_nums[0]

item["visitsCount"] = question_nums[1]

item["topic_views"] = question_nums[2]

topic_tags = response.css(".zm-item-tag::text").extract()

if len(topic_tags) >= 3:

item["topic_tag0"] = topic_tags[0].strip()

item["topic_tag1"] = topic_tags[1].strip()

item["topic_tag2"] = topic_tags[2].strip()

print item

elif len(topic_tags) == 2:

item["topic_tag0"] = topic_tags[0].strip()

item["topic_tag1"] = topic_tags[1].strip()

item["topic_tag2"] ='-'

elif len(topic_tags) == 1:

item["topic_tag0"] = topic_tags[0].strip()

item["topic_tag1"] ='-'

item["topic_tag2"] ='-'

# printtype(item["title"])

question_links = response.css(".question_link::attr(href)").extract()

yield item

for url in question_links:

yield scrapy.Request(url =self.zhihu_url+url,

headers = self.headers_dict,

callback = self.parse_question_data)

迴圈抓取，直到結束

3 編輯pipelines.py

匯入到資料庫中。

3.1 函式def open_spider(self, spider):

當spider被開啟時，這個方法被調用

3.2 函式def process_item(self, item, spider):

每個item pipeline元件都需要呼叫該方法，這個方法必須返回一個 Item (或任何繼承類)物件，或是丟擲 DropItem 異常，被丟棄的item將不會被之後的pipeline組件所處理。

3.3 編輯SETTING

修改ROBOTSTXT_OBEY = False

4 items內容

import scrapy

class ZhihuItem(scrapy.Item):

# define the fields for youritem here like:

# name = scrapy.Field()

qid = scrapy.Field()

title = scrapy.Field()

followers_num = scrapy.Field()

answers_num = scrapy.Field()

visitsCount = scrapy.Field()

topic_views = scrapy.Field()

topic_tag0 = scrapy.Field()

topic_tag1 = scrapy.Field()

topic_tag2 = scrapy.Field()

5 SPDIER內容

#coding=utf-8

import scrapy

import os

import time

import re

import json

from ..items import ZhihuItem

class zhihutopicSpider(scrapy.Spider):

zhihu_url ="https://www.zhihu.com"

headers_dict = {

"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",

"Accept-Language":"zh-CN,zh;q=0.8",

"Connection": "keep-alive",

"Host":"www.zhihu.com",

"Upgrade-Insecure-Requests": "1",

"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36(KHTML, like Gecko) Chrome/48.0.2564.109 Safari/537.36"

}

def start_requests(self):

yield scrapy.Request(

url=self.zhihu_url,

headers=self.headers_dict,

callback=self.request_captcha

)

def request_captcha(self,response):

_xsrf =response.css('input[name="_xsrf"]::attr(value)').extract()[0]

captcha_url ="http://www.zhihu.com/captcha.gif?r=" + str(time.time() * 1000)

yield scrapy.Request(

url=captcha_url,

headers=self.headers_dict,

callback=self.download_captcha

)

def download_captcha(self,response):

withopen("captcha.gif", "wb") as fp:

fp.write(response.body)

os.system('opencaptcha.gif')

print "請輸入驗證碼:\n"

captcha = raw_input()

yield scrapy.FormRequest(

url=self.login_url,

headers=self.headers_dict,

formdata={

"email":email,

"password": password,

"_xsrf":response.meta[“_xsrf"],

"remember_me": "true",

"captcha":captcha

callback=self.request_zhihu

)

def request_zhihu(self,response):

yieldscrapy.Request(url=self.topic + '/19760570',

headers=self.headers_dict,

callback=self.get_topic_question,

dont_filter=True)

def get_topic_question(self,response):

# withopen("topic.html", "wb") as fp:

# fp.write(response.body)

question_urls =response.css(".question_link[target=_blank]::attr(href)").extract()

length = len(question_urls)

k = -1

j = 0

temp = []

for j in range(length/3):

temp.append(question_urls[k+3])

j+=1

k+=3

for url in temp:

yield scrapy.Request(url= self.zhihu_url+url,

headers =self.headers_dict,

callback =self.parse_question_data)

def parse_question_data(self,response):

item = zhihuQuestionItem()

item["qid"] =re.search('\d+',response.url).group()

item["title"] =response.css(".zm-item-title::text").extract()[0].strip()

item["answers_num"] =response.css("h3::attr(data-num)").extract()[0]

question_nums =response.css(".zm-side-section-inner .zg-gray-normalstrong::text").extract()

item["followers_num"] = question_nums[0]

item["visitsCount"] = question_nums[1]

item["topic_views"] = question_nums[2]

topic_tags =response.css(".zm-item-tag::text").extract()

if len(topic_tags) >= 3:

item["topic_tag0"] = topic_tags[0].strip()

item["topic_tag1"] = topic_tags[1].strip()

item["topic_tag2"] = topic_tags[2].strip()

elif len(topic_tags) == 2:

item["topic_tag0"] = topic_tags[0].strip()

item["topic_tag1"] = topic_tags[1].strip()

item["topic_tag2"] = '-'

elif len(topic_tags) == 1:

item["topic_tag0"] = topic_tags[0].strip()

item["topic_tag1"] = '-'

item["topic_tag2"] = '-'

# printtype(item["title"])

question_links =response.css(".question_link::attr(href)").extract()

yield item

for url in question_links:

yield scrapy.Request(url =self.zhihu_url+url,

headers =self.headers_dict,

callback =self.parse_question_data)

6 PIPELINE內容

import MySQLdb

class ZhihuPipeline(object):

print "\n\n\n\n\n\n\n\n"

sql_questions = (

"INSERTINTO questions("

"qid,title, answers_num, followers_num, visitsCount, topic_views, topic_tag0,topic_tag1, topic_tag2) "

"VALUES('%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s', '%s')")

count = 0

def open_spider(self, spider):

host = "localhost"

user = "root"

password = "wangqi"

dbname = "zh"

self.conn = MySQLdb.connect(host, user, password, dbname)

self.cursor = self.conn.cursor()

self.conn.set_character_set('utf8')

self.cursor.execute('SET NAMES utf8;')

self.cursor.execute('SET CHARACTER SET utf8;')

self.cursor.execute('SET character_set_connection=utf8;')

print "\n\nMYSQL DB CURSOR INITSUCCESS!!\n\n"

sql = (

"CREATETABLE IF NOT EXISTS questions ("

"qid VARCHAR (100) NOTNULL,"

"title varchar(100),"

"answers_num INT(11),"

"followers_num INT(11) NOT NULL,"

"visitsCount INT(11),"

"topic_views INT(11),"

"topic_tag0 VARCHAR (600),"

"topic_tag1 VARCHAR (600),"

"topic_tag2 VARCHAR (600),"

"PRIMARY KEY (qid)"

")")

self.cursor.execute(sql)

print "\n\nTABLES ARE READY!\n\n"

def process_item(self, item,spider):

sql = self.sql_questions % (item["qid"], item["title"], item["answers_num"],item["followers_num"],

item["visitsCount"], item["topic_views"], item["topic_tag0"], item["topic_tag1"], item["topic_tag2"])

self.cursor.execute(sql)

if self.count % 10 == 0:

self.conn.commit()

self.count += 1

print item["qid"] +" DATA COLLECTED!"

7 執行

scrapy crawl zhihu

8 關於反爬

robots.txt（統一小寫）是一種存放於網站根目錄下的ASCII編碼的文字檔案，它通常告訴網路蜘蛛，此網站中的哪些內容是不應被搜尋引擎的網路蜘蛛獲取的，哪些是可以被網路蜘蛛獲取的。robots.txt是一個這個紳士協議也不是一個規範，而只是約定俗成的，有些搜尋引擎會遵守這一規範，而其他則不然。這就說明了scrapy自動遵守了robots協議.（這個要在settings.py裡面設定不遵守才可以爬得到把scrapy寫進robots.txt的網站）

給我老師的人工智慧教程打call！http://blog.csdn.net/jiangjunshow

scrapy——抓取知乎

1 建立專案

2 編輯SPIDER

2.1 函式def start_requests(self):

2.1.1 scrapy.Request類

2.1.2 request的引數說明：

2.2 函式def request_captcha(self, response):

2.3 函式 def download_captcha(self, response):

2.3.1 FormRequest

2.4 函式def request_zhihu(self, response):

2.5 函式def get_topic_question(self, response):

2.6 函式def parse_question_data(self, response):

3 編輯pipelines.py

3.1 函式def open_spider(self, spider):

3.2 函式def process_item(self, item, spider):

3.3 編輯SETTING

4 items內容

5 SPDIER內容

6 PIPELINE內容

7 執行

8 關於反爬

給我老師的人工智慧教程打call！http://blog.csdn.net/jiangjunshow

相關推薦