1. 程式人生 > >智聯招聘抓取---scrapy框架和requests庫兩種方式實現

智聯招聘抓取---scrapy框架和requests庫兩種方式實現

#首先分析目標站點,分析得出結果是在json接口裡,然後抓取企業資訊需要再次請求頁面進行抓取 在這裡插入圖片描述

#1.直接requests請求進行抓取儲存

##需要注意點:

  1. 可能不同企業單頁排版不一樣,需要判斷採取不同形式
  2. 儲存為csv檔案注意格式,保證資料表格不換行需要新增 newline=’’
import requests
import json
from lxml import etree
import csv

lists=[]
for n in range(0,1):
    url="https://fe-api.zhaopin.com/c/i/sou?start={}&pageSize=60&cityId=530&workExperience=-1&education=-1&companyType=-1&employmentType=-1&jobWelfareTag=-1&kw=python&kt=3&lastUrlQuery=%7B%22p%22:2,%22pageSize%22:%2260%22,%22jl%22:%22530%22,%22kw%22:%22python%22,%22kt%22:%223%22%7D".format(n*60)

    response=json.loads(requests.get(url).text)
    # print(response)
    for i in range(0,60):
        page=response["data"]["results"][i]["company"]["url"]
        # print(page)
        if len(page)< 48:
            html=requests.get(page).text

            a=etree.HTML(html)
            dizi=a.xpath('//table[@class="comTinyDes"]//span[@class="comAddress"]/text()')

            jianjie=a.xpath('string(//div[@class="part2"]//div)').strip()

            gongsi = response["data"]["results"][i]["company"]["name"]

            guimo = response["data"]["results"][i]["company"]["size"]["name"]

            xinchou = response["data"]["results"][i]["salary"]


            lists.append([i+1,gongsi,page,guimo,xinchou,dizi,jianjie])
            print(lists)
            print(gongsi)
            print(page)
            print(guimo)
            print(xinchou)
            print(dizi)
            print(jianjie)
            print("*"*50)
            # with open("aa.txt","a",encoding="utf-8") as f:
            #     f.write("{}{}  {}  {}  {} {} {}".format(i+1,gongsi,page,guimo,xinchou,dizi,jianjie))
                # f.write("\n")


        else:
            continue

with open("aa.csv", 'w', encoding='utf-8',newline='') as f:
    k = csv.writer(f, dialect='excel')
    k.writerow(["數量", "公司", "網址", "規模", "薪酬", "地址", "簡介"])

    for list in lists:
        k.writerow(list)
        # print("="*20)

#2.用scrapy框架進行抓取

需要注意點:

def parse(self, response):
    item = ItemClass()
    yield Request(url, meta={'item': item}, callback=self.parse_item)
def parse(self, response):
    item = response.meta['item']
    item['field'] = value
    yield item

作者:何健
連結:https://www.zhihu.com/question/54773510/answer/141177867
來源:知乎
著作權歸作者所有。商業轉載請聯絡作者獲得授權,非商業轉載請註明出處。
  1. 儲存為csv檔案換行問題處理 scrapy crawl zhilian -o aaa.csv 修改scrapy的原始碼 原始碼目錄D:\Python36\Lib\site-packages\scrapy\exporters.py 新增一行 newline="",
class CsvItemExporter(BaseItemExporter):
    def __init__(self, file, include_headers_line=True, join_multivalued=',', **kwargs):
        ...
        self.stream = io.TextIOWrapper(
            file,
            newline="",     # 新新增的
            line_buffering=False,
            write_through=True,
            encoding=self.encoding
        ) if six.PY3 else file

---------------------

本文來自 範翻番樊 的CSDN 部落格 ,全文地址請點選:https://blog.csdn.net/u011361138/article/details/79912895?utm_source=copy 
  1. scrapy引用items方法常路徑不對出錯解決方法 這是因為編譯器的問題,pycharm不會將當前檔案目錄自動加入自己的sourse_path

那麼具體的解決方法如下:

1)找到你的scrapy專案上右鍵

2)然後點選make_directory as

3)最後點選sources root

4)看到資料夾程式設計藍色就成功了

#最後是scrapy抓取智聯招聘程式碼spider:

# -*- coding: utf-8 -*-
import scrapy
import json
from zhilianzp.items import ZhilianzpItem

cc={}
class ZhilianSpider(scrapy.Spider):
    name = 'zhilian'

    # start_urls = ['https://www.baidu.com/']
    def start_requests(self):
        url = "https://fe-api.zhaopin.com/c/i/sou?start=0&pageSize=60&cityId=530&workExperience=-1&education=-1&companyType=-1&employmentType=-1&jobWelfareTag=-1&kw=python&kt=3&lastUrlQuery=%7B%22p%22:2,%22pageSize%22:%2260%22,%22jl%22:%22530%22,%22kw%22:%22python%22,%22kt%22:%223%22%7D"
        yield  scrapy.Request(url=url,callback=self.parse)


    def parse(self, response):
        content=json.loads(response.text)
        item = ZhilianzpItem()
        for i in range(0, 60):
            page = content["data"]["results"][i]["company"]["url"]
            # item = ZhilianzpItem()
            # print(page)
            if len(page) < 48:

                item["gongsi"]=content["data"]["results"][i]["company"]["name"]
                item["guimo"]=content["data"]["results"][i]["company"]["size"]["name"]
                item["xinchou"]=content["data"]["results"][i]["salary"]
                yield scrapy.Request(page,meta={"key":item},callback=self.next_parse)
                # print(item["gongsi"])
            else:
                continue
        # return content
        # yield item
    def next_parse(self,response):

        # item = ZhilianzpItem()
        item=response.meta['key']
        # item["gongsi"] = content["data"]["results"][i]["company"]["name"]
        item["dizi"]= response.xpath('//table[@class="comTinyDes"]//span[@class="comAddress"]/text()').extract()

        item["jianjie"] = response.xpath('string(//div[@class="part2"]//div)').extract_first()
        yield item
        # print(jianjie)