使用scrapy爬取豆瓣電影Top250
阿新 • • 發佈:2018-11-10
根據官方文件做的簡單練習,唯一遇到的問題就是爬取返回403.解決方法是在settings.py檔案中增加以下引數:
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
下面是spider的內容:
# -*- coding: utf-8 -*- import scrapy class MoviesSpider(scrapy.Spider): name = 'movies' allowed_domains = ['movie.douban.com'] start_urls = ['https://movie.douban.com/top250'] def parse(self, response): grid_view=response.css('ol.grid_view') for li_item in grid_view.css('li'): yield { 'rank':li_item.css('div.item div.pic em::text').extract_first(), 'url':li_item.css('div.item div.pic a::attr(href)').extract_first(), 'title_zh':li_item.css('div.hd a span:first-child::text').extract_first(), 'title_en':li_item.css('div.hd a span:nth-child(2)::text').extract_first(), 'title_tw':li_item.css('div.hd a span:last-child::text').extract_first(), 'editor':li_item.css('div.bd p:first-child::text').extract_first(), 'star':li_item.css('div.bd div.star span.rating_num::text').extract_first(), 'votes':li_item.css('div.bd div.star span:last-child::text').re(r'(\d+)')[0], 'desc':li_item.css('span.inq::text').extract_first() } next_page=response.css('span.next a::attr(href)').extract_first() if next_page is not None: yield response.follow(next_page,self.parse)