1. 程式人生 > ><scrapy爬蟲>爬取騰訊社招信息

<scrapy爬蟲>爬取騰訊社招信息

extra rul topic osi .org 接收 處理 += doc

1.創建scrapy項目

dos窗口輸入:

scrapy startproject tencent
cd tencent

2.編寫item.py文件(相當於編寫模板,需要爬取的數據在這裏定義)

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class TencentItem(scrapy.Item):
    # define the fields for your item here like:
    #職位名
    positionname = scrapy.Field()
    #鏈接
    positionlink = scrapy.Field()
    #類別
    positionType = scrapy.Field()
    #招聘人數
    positionNum = scrapy.Field()
    #工作地點
    positioncation = scrapy.Field()
    #職位名稱
    positionTime = scrapy.Field()

3.創建爬蟲文件

dos窗口輸入:

scrapy genspider myspider tencent.com

4.編寫myspider.py文件(接收響應,處理數據)

# -*- coding: utf-8 -*-
import scrapy
from tencent.items import TencentItem

class MyspiderSpider(scrapy.Spider):
    name = ‘myspider‘
    allowed_domains = [‘tencent.com‘]
    url = ‘https://hr.tencent.com/position.php?&start=‘
    offset = 0
    start_urls = [url+str(offset)]


    def parse(self, response):
        for each in response.xpath(‘//tr[@class="even"]|//tr[class="odd"]‘):
            #初始化模型對象
            item = TencentItem()
            # 職位名
            item[‘positionname‘] = each.xpath("./td[1]/a/text()").extract()[0]
            # 鏈接
            item[‘positionlink‘] = ‘http://hr.tencent.com/‘ + each.xpath("./td[1]/a/@href").extract()[0]
            # 類別
            item[‘positionType‘] = each.xpath("./td[2]/text()").extract()[0]
            # 招聘人數
            item[‘positionNum‘] = each.xpath("./td[3]/text()").extract()[0]
            # 工作地點
            item[‘positioncation‘] = each.xpath("./td[4]/text()").extract()[0]
            # 職位名稱
            item[‘positionTime‘] = each.xpath("./td[5]/text()").extract()[0]
            yield item
        if self.offset < 2820:
            self.offset += 10
        else:
            raise ("程序結束")
        yield scrapy.Request(self.url+str(self.offset),callback=self.parse)

5.編寫pipelines.py(存儲數據)

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don‘t forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json

class TencentPipeline(object):
    def __init__(self):
        self.filename = open(‘tencent.json‘,‘wb‘)

    def process_item(self, item, spider):
        text =json.dumps(dict(item),ensure_ascii=False) + ‘,\n‘
        self.filename.write(text.encode(‘utf-8‘))
        return item

    def close_spider(self):
        self.filename.close()

6.編寫settings.py(設置headers,pipelines等)

robox協議

# Obey robots.txt rules
ROBOTSTXT_OBEY = False  

headers

DEFAULT_REQUEST_HEADERS = {
    ‘user-agent‘: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36‘,
    ‘Accept‘: ‘text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8‘,
  # ‘Accept-Language‘: ‘en‘,
}

pipelines

ITEM_PIPELINES = {
    ‘tencent.pipelines.TencentPipeline‘: 300,
}

7.運行爬蟲

dos窗口輸入:

scrapy crawl myspider 

運行結果:

技術分享圖片

技術分享圖片

查看debug:

2019-02-18 16:02:22 [scrapy.core.scraper] ERROR: Spider error processing <GET https://hr.tencent.com/position.php?&start=520> (referer: https://hr.tencent.com/position.php?&start=510)
Traceback (most recent call last):
  File "E:\software\ANACONDA\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
    yield next(it)
  File "E:\software\ANACONDA\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 30, in process_spider_output
    for x in result:
  File "E:\software\ANACONDA\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "E:\software\ANACONDA\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "E:\software\ANACONDA\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "C:\Users\123\tencent\tencent\spiders\myspider.py", line 22, in parse
    item[‘positionType‘] = each.xpath("./td[2]/text()").extract()[0]  

去網頁查看:

技術分享圖片

這個職位少一個屬性- -!!!(城市套路多啊!)

那就改一下myspider.py裏面的一行:

item[‘positionType‘] = each.xpath("./td[2]/text()").extract()[0] 

加個判斷,改為:

if len(each.xpath("./td[2]/text()").extract()) > 0:
  item[‘positionType‘] = each.xpath("./td[2]/text()").extract()[0]
else:
  item[‘positionType‘] = "None"

 運行結果:

技術分享圖片

 看網站上最後一頁:

技術分享圖片

爬取成功!

<scrapy爬蟲>爬取騰訊社招信息