1. 程式人生 > >Scrapy模擬登錄GitHub

Scrapy模擬登錄GitHub

Coding -c setting encode debug png alt Language rul

d:

進入D盤

scrapy startproject GitHub

創建項目

scrapy genspider github github.com

創建爬蟲

技術分享圖片

編輯github.py:

# -*- coding: utf-8 -*-
import scrapy
from scrapy import Request, FormRequest


class GithubSpider(scrapy.Spider):
name = ‘github‘
allowed_domains = [‘github.com‘]

headers = {
‘Accept‘: ‘text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8‘,
‘Accept-Encoding‘: ‘gzip, deflate, br‘,
‘Accept-Language‘: ‘zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3‘,
‘Connection‘: ‘keep-alive‘,
‘Referer‘: ‘https://github.com/‘,
‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:52.0) Gecko/20100101 Firefox/52.0‘,
‘Content-Type‘: ‘application/x-www-form-urlencoded‘
}
# 請求頭

def start_requests(self):
# 重寫start_requests方法
urls = [‘https://github.com/login‘]
for url in urls:
yield Request(url, meta={‘cookiejar‘: 1}, callback=self.github_login)
# 通過meta傳入cookiejar特殊key,爬取url作為參數傳給回調函數
# meta:字典格式的元數據
# cookiejar:是meta的一個特殊的key,通過cookiejar參數可以支持多個會話對某網站進行爬取
# 可以對cookie做標記1, 2, 3, 4......這樣scrapy就維持了多個會話

def github_login(self, response):
authenticity_token = response.xpath(".//*[@id=‘login‘]/form/input[2]/@value").extract_first()
# 首先從源碼中獲取到authenticity_token的值
return FormRequest.from_response(
response,
url=‘https://github.com/session‘,
meta={‘cookiejar‘: response.meta[‘cookiejar‘]},
headers=self.headers,
formdata={
‘authenticity_token‘: authenticity_token,
‘commit‘: ‘Sign in‘,
‘login‘: [email protected],
‘password‘: ‘caihong@1234‘,
‘utf8‘: ‘?‘
},
callback=self.github_after,
dont_click=True
# dont_click如果是True,表單數據將被提交,而不需要單擊任何元素
)

def github_after(self, response):
home_page = response.xpath(".//*[@id=‘dashboard‘]/div[2]/div[1]/nav/a[1]/text()").extract()
# 獲取登錄成功後頁面中的文本“Browse activity”

if ‘Browse activity‘ in home_page:
self.logger.info(‘登錄成功!‘)
# 如果含有“Browse activity”,則打印登錄成功
else:
self.logger.error(‘登錄失敗!‘)

新建debug.py調試腳本:

# -*- coding: utf-8 -*-
from scrapy import cmdline

cmdline.execute(‘scrapy crawl github‘.split())

修改settings.py配置文件:

第23行修改為:

# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# 遵循Robot協議

Scrapy模擬登錄GitHub