Scrapy抓取起點中文網排行榜
阿新 • • 發佈:2018-07-18
pro 起點 type [1] -m += 描述 頁面 名稱
項目名稱:qidian
項目描述:利用scrapy抓取七點中文網的“完本榜”總榜的500本小說,抓取內容包括:小說名稱,作者,類別,然後保存為CSV文件
目標URL:https://www.qidian.com/rank/fin?style=1
項目需求:
1.小說名稱
2.作者
3.小說類別
第一步:在shell中創建項目
scrapy startproject qidian
第二步:根據項目需求編輯items.py
1 #-*- coding: utf-8 -*- 2 import scrapy 3 4 class QidianItem(scrapy.Item): 5 name = scrapy.Field()6 author = scrapy.Field() 7 category = scrapy.Field()
第三步:進行頁面分析,利用xpath或者css提取數據,創建並編輯spider.py
1 # -*- coding: utf-8 -*- 2 import scrapy 3 from ..items import QidianItem 4 5 class QidianSpider(scrapy.Spider): 6 name = ‘qidian‘ 7 start_urls = [‘https://www.qidian.com/rank/fin?style=1&dateType=3‘] 8 9 def parse(self, response): 10 sel = response.xpath(‘//div[@class="book-mid-info"]‘) 11 for i in sel: 12 name = i.xpath(‘./h4/a/text()‘).extract_first() 13 author = i.xpath(‘./p[@class="author"]/a[1]/text()‘).extract_first() 14 category = i.xpath(‘./p[@class="author"]/a[last()]/text()‘).extract_first() 15 item = QidianItem() 16 item[‘name‘] = name 17 item[‘author‘] = author 18 item[‘category‘] = category 19 yield item
上面這裏是一頁的數據,接下來抓取一下頁的連接(因為項目過於小巧,我認為沒必要用到一些高大上的方法來實現,直接觀察URL的構造規律就可以簡單寫出代碼),下面是spider.py的完整代碼
1 # -*- coding: utf-8 -*- 2 import scrapy 3 from ..items import QidianItem 4 5 class QidianSpider(scrapy.Spider): 6 name = ‘qidian‘ 7 start_urls = [‘https://www.qidian.com/rank/fin?style=1&dateType=3‘] 8 n = 1 #第一頁 9 10 def parse(self, response): 11 sel = response.xpath(‘//div[@class="book-mid-info"]‘) 12 for i in sel: 13 name = i.xpath(‘./h4/a/text()‘).extract_first() 14 author = i.xpath(‘./p[@class="author"]/a[1]/text()‘).extract_first() 15 category = i.xpath(‘./p[@class="author"]/a[last()]/text()‘).extract_first() 16 item = QidianItem() 17 item[‘name‘] = name 18 item[‘author‘] = author 19 item[‘category‘] = category 20 yield item
21 22 if self.n < 25: 23 self.n += 1 #n表示頁碼 24 next_url = ‘https://www.qidian.com/rank/fin?style=1&dateType=3&page=%d‘ % self.n 25 yield scrapy.Request(next_url, callback = parse)
第四步:啟動爬蟲並保存數據
scrapy crawl qidian -o qidian.csv
Scrapy抓取起點中文網排行榜