1. 程式人生 > >python3 + scrapy 抓取boss直聘崗位

python3 + scrapy 抓取boss直聘崗位

前言:本文為記錄工程實現過程,會引用其他文章,如果又不清晰的地方可以檢視原文章。本文主旨在於記錄,所以部分作者瞭解的部分可能不會介紹而直接操作,如果有疑問請留言或者直接使用搜索引擎。

引用:

windows安裝scrapy

建立第一個scrapy工程

一、安裝scrapy

管理員模式開啟power shell,輸入

pip install scrapy

ps:此步之前,需要先行安裝pip,具體請自行搜尋。

 

二、到某路徑下建立scrapy工程

scrapy startproject boss

 

三、開啟工程目錄

cd boss

 

四、建立爬蟲

scrapy genspider bosszhipin www.zhipin.com

 

五、將爬蟲工程匯入pycharm,修改setting.py

將 ROBOTSTXT_OBEY = True

改為 ROBOTSTXT_OBEY = False

 

六、編寫bosszhipin.py和run.py

# -*- coding: utf-8 -*-
import scrapy


class BosszhipinSpider(scrapy.Spider):
    name = 'bosszhipin'
    allowed_domains 
= ['www.zhipin.com'] start_urls = ['https://www.zhipin.com/c101270100-p100101/?page=1&ka=page-1'] def parse(self, response): print(response.text)

 

run.py放在專案根目錄

from scrapy.cmdline import execute
execute(['scrapy','crawl','bosszhipin'])

 

執行出現錯誤

2018-11-04 13:03:36 [scrapy.core.engine] INFO: Spider opened
2018-11-04 13:03:36 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2018-11-04 13:03:36 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2018-11-04 13:03:37 [scrapy.core.engine] DEBUG: Crawled (403) <GET https://www.zhipin.com/c101270100-p100101/?page=1&ka=page-1> (referer: None) 2018-11-04 13:03:37 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <403 https://www.zhipin.com/c101270100-p100101/?page=1&ka=page-1>: HTTP status code is not handled or not allowed 2018-11-04 13:03:37 [scrapy.core.engine] INFO: Closing spider (finished)

 

 連結被關閉,應該是被反爬了,修改中介軟體來修改headers

middlewares.py 中加入

class UserAgentMiddleware(object):

    def __init__(self, user_agent_list):
        self.user_agent = user_agent_list

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        # 獲取配置檔案中的 MY_USER_AGENT 欄位
        middleware = cls(crawler.settings.get('MY_USER_AGENT'))
        return middleware

    def process_request(self, request, spider):
        # 隨機選擇一個 user-agent
        request.headers['user-agent'] = random.choice(self.user_agent)

在setting中啟用中介軟體和MY_USER_AGENT的值

USER_AGENT = 'boss (+http://www.yourdomain.com)'
...
DOWNLOADER_MIDDLEWARES = {
   'boss.middlewares.BossDownloaderMiddleware': 543,
}

(以上程式碼預設有實現,只是被註釋了,建議先啟用試試能不能用,不能用再找解決方法)

再次執行run.py,可以獲取頁面html資訊。

 

第一階段全部程式碼,後期準備加上MongoDB,因為看不出來爬文字直接輸出有什麼卵用。。。

# -*- coding: utf-8 -*-
import scrapy


class BosszhipinSpider(scrapy.Spider):
    name = 'bosszhipin'
    allowed_domains = ['www.zhipin.com']
    start_urls = ['https://www.zhipin.com/c101270100-p100101/?page=1&ka=page-1']

    def parse(self, response):
        # print(response.text)

        job_node_table = response.xpath("//*[@id=\"main\"]/div/div[2]/ul")
        job_node_list = job_node_table.xpath("./li")
        for job_node in job_node_list:
            enterprise_node = job_node.xpath("./div/div[2]/div/h3/a")
            salary_node = job_node.xpath("./div/div[1]/h3/a/span")
            requirement_node = job_node.xpath("./div/div[1]/p")
            time_node = job_node.xpath("./div/div[3]/p")

            enterprise = enterprise_node.xpath('string(.)')
            salary = salary_node.xpath('string(.)')
            requirement = requirement_node.xpath('string(.)')
            time = time_node.xpath('string(.)')


            print("企業", enterprise.extract_first().strip())
            print("薪資", salary.extract_first().strip())
            print("要求", requirement.extract_first().strip())
            print("更新", time.extract_first().strip())
            print()