<scrapy爬蟲>爬取騰訊社招信息

阿新 • • 發佈：2019-02-18

extra rul topic osi .org 接收處理 += doc

1.創建scrapy項目

dos窗口輸入:

scrapy startproject tencent

cd tencent

2.編寫item.py文件(相當於編寫模板,需要爬取的數據在這裏定義)

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class TencentItem(scrapy.Item):
    # define the fields for your item here like:
    #職位名
    positionname = scrapy.Field()
    #鏈接
    positionlink = scrapy.Field()
    #類別
    positionType = scrapy.Field()
    #招聘人數
    positionNum = scrapy.Field()
    #工作地點
    positioncation = scrapy.Field()
    #職位名稱
    positionTime = scrapy.Field()

3.創建爬蟲文件

dos窗口輸入:

scrapy genspider myspider tencent.com

4.編寫myspider.py文件(接收響應,處理數據)

# -*- coding: utf-8 -*-
import scrapy
from tencent.items import TencentItem

class MyspiderSpider(scrapy.Spider):
    name = ‘myspider‘
    allowed_domains = [‘tencent.com‘]
    url = ‘https://hr.tencent.com/position.php?&start=‘
    offset = 0
    start_urls = [url+str(offset)]


    def parse(self, response):
        for each in response.xpath(‘//tr[@class="even"]|//tr[class="odd"]‘):
            #初始化模型對象
            item = TencentItem()
            # 職位名
            item[‘positionname‘] = each.xpath("./td[1]/a/text()").extract()[0]
            # 鏈接
            item[‘positionlink‘] = ‘http://hr.tencent.com/‘ + each.xpath("./td[1]/a/@href").extract()[0]
            # 類別
            item[‘positionType‘] = each.xpath("./td[2]/text()").extract()[0]
            # 招聘人數
            item[‘positionNum‘] = each.xpath("./td[3]/text()").extract()[0]
            # 工作地點
            item[‘positioncation‘] = each.xpath("./td[4]/text()").extract()[0]
            # 職位名稱
            item[‘positionTime‘] = each.xpath("./td[5]/text()").extract()[0]
            yield item
        if self.offset < 2820:
            self.offset += 10
        else:
            raise ("程序結束")
        yield scrapy.Request(self.url+str(self.offset),callback=self.parse)

5.編寫pipelines.py(存儲數據)

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don‘t forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json

class TencentPipeline(object):
    def __init__(self):
        self.filename = open(‘tencent.json‘,‘wb‘)

    def process_item(self, item, spider):
        text =json.dumps(dict(item),ensure_ascii=False) + ‘,\n‘
        self.filename.write(text.encode(‘utf-8‘))
        return item

    def close_spider(self):
        self.filename.close()

6.編寫settings.py(設置headers,pipelines等)

robox協議

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

headers

DEFAULT_REQUEST_HEADERS = {
    ‘user-agent‘: ‘Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36‘,
    ‘Accept‘: ‘text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8‘,
  # ‘Accept-Language‘: ‘en‘,
}

pipelines

ITEM_PIPELINES = {
    ‘tencent.pipelines.TencentPipeline‘: 300,
}

7.運行爬蟲

dos窗口輸入:

scrapy crawl myspider

運行結果:

技術分享圖片

查看debug:

2019-02-18 16:02:22 [scrapy.core.scraper] ERROR: Spider error processing <GET https://hr.tencent.com/position.php?&start=520> (referer: https://hr.tencent.com/position.php?&start=510)
Traceback (most recent call last):
  File "E:\software\ANACONDA\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
    yield next(it)
  File "E:\software\ANACONDA\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 30, in process_spider_output
    for x in result:
  File "E:\software\ANACONDA\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "E:\software\ANACONDA\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "E:\software\ANACONDA\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "C:\Users\123\tencent\tencent\spiders\myspider.py", line 22, in parse
    item[‘positionType‘] = each.xpath("./td[2]/text()").extract()[0]

去網頁查看:

技術分享圖片

這個職位少一個屬性- -!!!(城市套路多啊!)

那就改一下myspider.py裏面的一行:

item[‘positionType‘] = each.xpath("./td[2]/text()").extract()[0]

加個判斷,改為:

if len(each.xpath("./td[2]/text()").extract()) > 0:
　　item[‘positionType‘] = each.xpath("./td[2]/text()").extract()[0]
else:
　　item[‘positionType‘] = "None"

　運行結果:

技術分享圖片

　看網站上最後一頁:

技術分享圖片

爬取成功!

<scrapy爬蟲>爬取騰訊社招信息

extra rul topic osi .org 接收處理 += doc 1.創建scrapy項目 dos窗口輸入: scrapy startproject tencent cd tencent 2.編寫item.py文件(相當於編寫模板,需要爬取的數據在這裏

python爬蟲3——爬取騰訊招聘全部招聘資訊

python爬蟲2中，已經有了初步的程式碼，之後做了優化增加了工作職責、工作要求：獲取的資料有：程式碼如下： #!/usr/bin/env python # -*- coding:utf-8 -*- from bs4 import BeautifulS

Python爬蟲練習——爬取騰訊新聞

在解析後的文字中，使用select選擇器，在文字中選擇指定的元素，通常我們還會使用find()和findall()方法來進行元素選擇。這一步返回的為一個列表，列表內的元素為匹配的元素的HTML原始碼。

我的第一個爬蟲，爬取北京地區短租房信息

爬取 connect except links 效率 chrom cti clas 爬蟲 # 導入程序所需要的庫。import requestsfrom bs4 import BeautifulSoupimport time# 加入請求頭偽裝成瀏覽器headers = {

BeautifulSoup4：抓取騰訊社招頁面的招聘資訊

Beautiful Soup 也是一個HTML/XML的解析器，主要的功能也是如何解析和提取 HTML/XML 資料。正則、Beautiful Soup、lml對比 lxml 只會區域性遍歷，而Beautifu

Python網絡爬蟲Scrapy+MongoDB +Redis實戰爬取騰訊視頻動態評論教學視頻

並發數 www. 深入圖例編程 ppt 研發 read 網絡爬蟲課程簡介學習Python爬蟲開發數據采集程序啦！網絡編程，數據采集、提取、存儲，陷阱處理……一站式全精通！！！目標人群掌握Python編程語言基礎，有誌從事網絡爬蟲開發及數據采集程序開發的人群。學習目

scrapy-redis例項，分佈爬蟲爬取騰訊新聞，儲存在資料庫中

本篇文章為scrapy-redis的例項應用，原始碼已經上傳到github: https://github.com/Voccoo/NewSpider 使用到了： python 3.x redis scrapy-redis pymysql Redis-Desktop-Manage

python爬蟲--scrapy爬取騰訊招聘網站

背景：虛擬機器Ubuntu16.04，爬取https://hr.tencent.com/招聘資訊！第一步：新建專案：scrapy startproject tencent第二步：編寫items檔案 1 # -*- coding: utf-8 -*- 2 3 # D

python+scrapy入門教程之爬取騰訊招聘職位資訊

我是用的IDE是pycharm,要想使用scrapy我們先安裝模組file-settings-project Interpreter 安裝完成之後我們開啟Terminal 在終端輸入：scrapy startproject tencent 建立spiders我們需要進入spi

Python爬蟲-爬取騰訊QQ招聘崗位資訊（Beautiful Soup）

爬取騰訊招聘資訊-Beautiful Soup --------------------------------------- ============================================ =================================

Python3 +Scrapy 爬取騰訊控股股票資訊存入資料庫中

目標網站：http://quotes.money.163.com/hkstock/cwsj_00700.html每支股票都有四個資料表找到這四個資料表的資訊所在資料名第一條到第三條資料所在其他三個表也是這樣子尋找，找到資料後，就可以動手爬取了。於2018\3\17 重寫。一.

python爬蟲學習筆記（一）—— 爬取騰訊視訊影評

前段時間我忽然想起來，以前本科的時候總有一些公眾號，能夠為我們提供成績查詢、課表查詢等服務。我就一直好奇它是怎麼做到的，經過一番學習，原來是運用了爬蟲的原理，自動登陸教務系統爬取的成績等內容。我覺得挺好玩的，於是自己也琢磨了一段時間，今天呢，我為大家分享一個爬蟲

Python爬蟲---爬取騰訊動漫全站漫畫

[TOC] ##操作環境 1. 編譯器：pycharm社群版 2. python 版本：anaconda python3.7.4 3. 瀏覽器選擇：Google瀏覽器 4. 需要用到的第三方模組：requests , lxml , selenium , time , bs4,os ##網頁分析 ###明確目標

Python 爬取騰訊電視劇評論

視頻評論爬取騰訊定向爬取騰訊電視劇評論本例思路：打開評論頁面，通過fiddler提取加載評論頁面的網址，對比分析url，構造內容和用戶pattern，然後爬取輸出。1，打開電視劇如果愛頁面https://v.qq.com/x/cover/zjfjxmtdzhowjoz.html，找到下圖影評位置，

小白scrapy爬蟲之爬取簡書網頁並下載對應鏈接內容

tps python 分享列表 scrapy 網頁 pytho 分享圖片介紹 *準備工作：爬取的網址：https://www.jianshu.com/p/7353375213ab 爬取的內容：下圖中python庫介紹的內容列表，並將其鏈接的文章內容寫進文本文件中小

用etree和Beautiful Soup爬取騰訊招聘網站

1.lxml 是一種使用 Python 編寫的庫,可以迅速、靈活地處理 XML ，支援 XPath (XML Path Language)，使用 lxml 的 etree 庫來進行爬取網站資訊 2.Beautiful Soup支援從HTML或XML檔案中提取資料的Python庫；支援Python標準庫中的H

騰訊社招爬取

目標任務：爬取騰訊社招資訊，需要爬取的內容為：職位名稱，職位的詳情連結，職位類別，招聘人數，工作地點，釋出時間。一、建立Scrapy專案 scrapy startproject Tencent 命令執行後，會建立一個Tencent資料夾，結構如下二、編寫item檔案，根

爬取騰訊課堂的課程評論

最近想了解一下線上教育的課程的如何去選擇，課程的質量如何？所以試著去爬了一下騰訊課堂，只爬了IT網際網路這一項。通過分析發現要想爬取到評論需要是個步驟：解析學習方向，如下圖所示：通過開發者工具審查元素，發現標籤在<dl class="sort-me

Python3.7爬取騰訊地圖關鍵詞位置及電話資訊

朋友創業需要拓展客戶，閒來無事幫朋友搞些資料，網上看到的全是爬取百度地圖的資料，無奈百度地圖AK一直申請不來，便摸索著做個騰訊地圖的小爬蟲，些許小感慨：資料時代，共享無限，隱私難藏啊！實現功能通過指定關鍵詞，自動搜尋騰訊地圖全國範圍內的相關位置及電話資訊，並將結果輸出

將scrapy爬蟲框架爬取到的資料存入mysql資料庫

使用scrapy爬取網站資料，是一個目前來說比較主流的一個爬蟲框架，也非常簡單。 1、建立好專案之後現在settings.py裡面把ROBOTSTXT_OBEY的值改為False，不然的話會預設遵循robots協議，你將爬取不到任何資料。 2、在爬蟲檔案裡開始寫

<scrapy爬蟲>爬取騰訊社招信息

1.創建scrapy項目

2.編寫item.py文件(相當於編寫模板,需要爬取的數據在這裏定義)

3.創建爬蟲文件

4.編寫myspider.py文件(接收響應,處理數據)

5.編寫pipelines.py(存儲數據)

6.編寫settings.py(設置headers,pipelines等)

7.運行爬蟲

相關推薦