Scrapy學習筆記
1.Scrapy是什麼
Scrapy是基於twisted的爬蟲框架,使用者定製開發幾個模組就可以實現爬蟲
2.Scrapy的優勢
沒有Scrapy要自己手寫爬蟲的時候,我們要用Urlib或Requests庫傳送請求、封裝http頭部資訊類、多執行緒、封裝代理類、封裝去重類、封裝資料儲存類、封裝去重類、封裝異常檢測機制
3.Scrapy架構
Scrapy Engine:Scrapy的引擎。它負責Scheduler,Pipeline,Spiders,Downloader之間的訊號、訊息和通訊傳遞
Scheduler:Scrapy的排程器。簡單地說是佇列,接受Scrapy Engine傳送來的Request,Scheduler對它們進行排隊,當Scrapy Engine需要資料時,Scheduler將請求佇列中的資料傳送給引擎
Downloader:Scrapy的下載器。它負責接受Scrapy Engine的Request,生成Response,並將其交還給Scrapy Engine,引擎再將Response交給Spiders
Spiders:Scrapy的爬蟲。它用來寫爬蟲邏輯,如編寫正則,BeautifulSoup,Xpath等;如果Response包含下一次請求,如“下一頁”,Spiders會將URL交給Scrapy Engine,再有引擎交給Scheduler進行排隊
Pipeline:Scrapy的管道。封裝去重類、儲存類的地方,負責資料的後期過濾、儲存等
Downloader:下載器。它負責傳送請求並下載資料
Downloader Middlewares:下載中介軟體。自定義擴充套件元件,是我們封裝代理、封裝HTTP頭的地方
Spider Middlewares:爬蟲中介軟體。可以封裝從Spiders傳送出去的Request和接受到的Response
4.Scrapy例子
4.1 爬取豆瓣電影Top250
搭建Scapy專案的教程網上有很多,可以自行百度
自定義代理中介軟體,這裡用到了本地Ip代理,大量爬蟲請求的話需要接入第三方代理工具。可以將爬取源Ip偽裝成如下代理
class specified_proxy(object): def proccess_request(self,request,spider): #隨機選取代理Ip PROXIES = ['http://183.207.95.27:80', 'http://111.6.100.99:80', 'http://122.72.99.103:80', 'http://106.46.132.2:80', 'http://112.16.4.99:81', 'http://123.58.166.113:9000', 'http://118.178.124.33:3128', 'http://116.62.11.138:3128', 'http://121.42.176.133:3128', 'http://111.13.2.131:80', 'http://111.13.7.117:80', 'http://121.248.112.20:3128', 'http://112.5.56.108:3128', 'http://42.51.26.79:3128', 'http://183.232.65.201:3128', 'http://118.190.14.150:3128', 'http://123.57.221.41:3128', 'http://183.232.65.203:3128', 'http://166.111.77.32:3128', 'http://42.202.130.246:3128', 'http://122.228.25.97:8101', 'http://61.136.163.245:3128', 'http://121.40.23.227:3128', 'http://123.96.6.216:808', 'http://59.61.72.202:8080', 'http://114.141.166.242:80', 'http://61.136.163.246:3128', 'http://60.31.239.166:3128', 'http://114.55.31.115:3128', 'http://202.85.213.220:3128'] random_proxy = random.sample(PROXIES, 1) request.meta['proxy'] = random_proxy
自定義user_agent,讓目標伺服器知道我們不是機器,而是從作業系統、瀏覽器等發出的請求
class specified_useragent(object): def proccess_request(self, request, spider): #隨機選取user_agent USER_AGENT_LIST = [ "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)", "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Acoo Browser; SLCC1; .NET CLR 2.0.50727; Media Center PC 5.0; .NET CLR 3.0.04506)", "Mozilla/4.0 (compatible; MSIE 7.0; AOL 9.5; AOLBuild 4337.35; Windows NT 5.1; .NET CLR 1.1.4322; .NET CLR 2.0.50727)", "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)", "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)", "Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)", "Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)", "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)", "Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1", "Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0", "Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5", "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11", "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20", "Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; fr) Presto/2.9.168 Version/11.52", ] agent = random.choice(USER_AGENT_LIST) request.headers['USER_AGNET'] = agent
配置完自定義中介軟體,要在Settings.py中引用它們
#數字越小優先順序越高 DOWNLOADER_MIDDLEWARES = {'ScrapyTest.middlewares.specified_proxy': 543, 'ScrapyTest.middlewares.specified_useragent': 544 }
在Items.py裡定義資料
import scrapy class ScrapytestItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() #電影序號 serial_number = scrapy.Field(); #電影名稱 movie_name = scrapy .Field(); #電影介紹 introduce = scrapy.Field(); #評分 star = scrapy.Field(); #電影的評論數 evaluate = scrapy.Field(); #電影描述 describe = scrapy.Field(); pass
在管道items.py中配置資料的儲存,連線Monodb
class ScrapytestPipeline(object): def __init__(self): host = monodb_host port = monodb_port dbname = monodb_db_name sheetname = monodb_tb_name client = pymongo.MongoClient(host=host,port=port) mydb = client[dbname] self.post = mydb[sheetname] def process_item(self, item, spider): data = dict(item) self.post.insert(data) return item
settings.py資料庫資訊
monodb_host = "127.0.0.1" monodb_port = 27017 monodb_db_name = "scrapy_test" monodb_tb_name = "douban_movie"
執行main後的效果
在Mongodb資料庫中可以看到插入進來的資料
use scrapy_test; show collections; db.douban_movie.find().pretty()
4.2 原始碼獲取
ofollow,noindex">https://github.com/cjy513203427/ScrapyTest