網路爬蟲筆記（Day5）——騰訊社招&拉勾網

阿新 • • 發佈：2018-11-01

分析過程與鏈家是一樣的。

騰訊社招完整程式碼如下：

import requests
from lxml import etree
from mysql_class import Mysql  # 自己封裝好的Mysql類


def txshezhao(keywords, page):
    '''
    :param keywords: 指定搜尋關鍵字進行資料爬取
    :param page: 用來控制爬取頁碼範圍
    :return: 將相關資訊儲存於text資料庫的tengxun表中
    '''
    count = 0
    while count <= page:   # 指定爬取前20頁
        url = 'https://hr.tencent.com/position.php?keywords={}&lid=2156&tid=87&start={}#a'.format(keywords, count*10)
        count += 1
        
        headers = {
            'Cookie': '_ga=GA1.2.552710032.1529846866; pgv_pvi=5319122944; PHPSESSID=a7let8q1aup7j9p40mubjq8h64; pgv_si=s6819970048',
            'Host': 'hr.tencent.com',
            'Referer': 'https://hr.tencent.com/position.php?keywords=%E8%AF%B7%E8%BE%93%E5%85%A5%E5%85%B3%E9%94%AE%E8%AF%8D&lid=2156&tid=87',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36',
        }
        
        res = requests.get(url, headers=headers)
        html = etree.HTML(res.text)
        
        for every in range(2, 12):  # (2,12)
            res_href = html.xpath('//table[@class="tablelist"]/tr[{}]/td[1]/a/@href'.format(every))
            href = 'https://hr.tencent.com/' + res_href[0]
            # print(href)   # 拿到每一頁的10個崗位的URL
            res = requests.get(href, headers=headers)
            # print(res.text)
            html1 = etree.HTML(res.text)
            info1 = html1.xpath('//td[@id="sharetitle"]//text()')
            
            job_name = str(info1[0])
            # print(job_name)
            res_msg = html1.xpath('//tr[@class ="c bottomline"]/td//text()')
            # print(res_msg)   # ['工作地點：', '北京', '職位類別：', '技術類', '招聘人數：', '1人']
            
            address = str(res_msg[1])
            # print(address)  # 北京
            
            category = str(res_msg[3])
            # print(category)
            
            number = str(res_msg[5])
            # print(number)
    
            information_list = html1.xpath('//table[@class="tablelist textl"]/tr[4]/td/ul//text()')
            req_info = ''
            for req_info1 in information_list:
                message = str(req_info1)
                req_info += message
                
            information = req_info
            # print(information)
            
            data = (job_name, address,category, number, information)
            Insert.mysql_op(sql, data)
    
        
if __name__ == '__main__':
    # MySQL語句
    Insert = Mysql()
    # 要執行的sql 語句
    sql = '''INSERT INTO tengxun (job_name, address, category, number, information) VALUES(%s, %s, %s, %s, %s)'''
    
    print('請在下面輸入關鍵字進行爬取資料：')
    keywords = input()
    txshezhao(keywords, 5)

拉鉤網完整程式碼如下：

import requests
from lxml import etree
import pymysql


class Mysql(object):
    '''執行資料操作封裝類'''
    
    def __init__(self):
        '''連線資料庫、建立遊標'''
        self.db = pymysql.connect(host="localhost", user="root", password="8888", database="test")
        self.cursor = self.db.cursor()
    
    def mysql_op(self, sql, data):
        '''MySQL語句'''
        self.cursor.execute(sql, data)
        self.db.commit()
    
    def __del__(self):
        '''關閉遊標、關閉資料庫'''
        self.cursor.close()
        self.db.close()


# MySQL語句
Insert = Mysql()
# 要執行的sql 語句
sql = '''INSERT INTO lagou (company, job_name, salary, adress, jingyan, school,job_des) VALUES(%s, %s, %s, %s, %s, %s, %s)'''

url = 'https://www.lagou.com/jobs/positionAjax.json?city=%E5%8C%97%E4%BA%AC&needAddtionalResult=false'
headers = {
    'Accept': 'application/json, text/javascript, */*; q=0.01',
    'Connection': 'keep-alive',
    'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
    # 'Cookie': 'JSESSIONID=ABAAABAAAGFABEF780FE198208BF21A58749B6B7C26C915; _ga=GA1.2.1321423683.1534510673; _gid=GA1.2.581729554.1534510673; user_trace_token=20180817205757-29e3715f-a21d-11e8-a9f0-5254005c3644; LGUID=20180817205757-29e375b6-a21d-11e8-a9f0-5254005c3644; index_location_city=%E5%8C%97%E4%BA%AC; TG-TRACK-CODE=search_code; X_HTTP_TOKEN=87d99de12746e518d50f2fe7fede59a0; PRE_UTM=; _gat=1; LGSID=20180818000633-829355a0-a237-11e8-a9f0-5254005c3644; PRE_HOST=www.baidu.com; PRE_SITE=https%3A%2F%2Fwww.baidu.com%2Flink%3Furl%3D33WZv6WWqh6LDiUr0dWxB6F4E9letiquzVMR10EQdIG%26wd%3D%26eqid%3Dd381bb6900049a8a000000035b76c64a; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2F; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1534510675,1534521990,1534522180; LGRID=20180818001023-0bf26e44-a238-11e8-91ae-525400f775ce; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1534522221; SEARCH_ID=86e7b49dc5cc476fb33fbb41c7409cf0',
    'Cookie': 'JSESSIONID=ABAAABAAAGFABEF780FE198208BF21A58749B6B7C26C915; _ga=GA1.2.1321423683.1534510673; _gid=GA1.2.581729554.1534510673; user_trace_token=20180817205757-29e3715f-a21d-11e8-a9f0-5254005c3644; LGUID=20180817205757-29e375b6-a21d-11e8-a9f0-5254005c3644; index_location_city=%E5%8C%97%E4%BA%AC; TG-TRACK-CODE=search_code; X_HTTP_TOKEN=87d99de12746e518d50f2fe7fede59a0; PRE_UTM=; LGSID=20180818000633-829355a0-a237-11e8-a9f0-5254005c3644; PRE_HOST=www.baidu.com; PRE_SITE=https%3A%2F%2Fwww.baidu.com%2Flink%3Furl%3D33WZv6WWqh6LDiUr0dWxB6F4E9letiquzVMR10EQdIG%26wd%3D%26eqid%3Dd381bb6900049a8a000000035b76c64a; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2F; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1534510675,1534521990,1534522180; LGRID=20180818001050-1bed8a51-a238-11e8-91ae-525400f775ce; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1534522248; SEARCH_ID=444ab1d908b04a32b195b1ac433ef583',
    'Host': 'www.lagou.com',
    'Origin': 'https://www.lagou.com',
    'Referer': 'https://www.lagou.com/jobs/list_%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90?labelWords=&fromSearch=true&suginput=',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36',
    'X-Anit-Forge-Code': '0',
    'X-Anit-Forge-Token': 'None',
    'X-Requested-With': 'XMLHttpRequest',
}
for page in range(1, 30):
    form = {
        'first': 'false',
        'pn': page,
        'kd': '資料分析'
    }
    
    response = requests.post(url, headers=headers, data=form)
    html = response.json()
    for url0 in range(15):  # 15
       
        info = html["content"]["positionResult"]["result"][url0]["positionId"]  # 4605300-----<class 'int'>
       
        url1 = 'https://www.lagou.com/jobs/' + str(info) + '.html'
        # print(url1)
        
        res = requests.get(url1, headers=headers)
        res_html = res.text
        res_element = etree.HTML(res_html)
        if res_element.xpath('//div[@class="job-name"]/div[1]') == []:
            break
        company = res_element.xpath('//div[@class="job-name"]/div[1]')[0].text

        job_name = res_element.xpath('//div[@class="job-name"]/span')[0].text

        salary = res_element.xpath('//dd[@class="job_request"]/p/span[1]')[0].text

        adress = res_element.xpath('//dd[@class="job_request"]/p/span[2]')[0].text

        jingyan = res_element.xpath('//dd[@class="job_request"]/p/span[3]')[0].text

        school = res_element.xpath('//dd[@class="job_request"]/p/span[4]')[0].text
        # description = res_element.xpath('//dd[@class="job_bt"]/h3')[0].text
        # print(description)
        
        des_msg = res_element.xpath('//dd[@class="job_bt"]/div//text()')
        # print(des_msg)
        job_des = ''
        for des_msg_one in des_msg:
            job_des += str(des_msg_one).strip('\n')
        print(job_des)

        data = (str(company), str(job_name), str(salary), str(adress).strip('/'), str(jingyan).strip('/'), str(school).strip('/'), str(job_des))
        Insert.mysql_op(sql, data)

鏈家、拉勾、Boss、等等這些網頁可以拿來學習練手，請不要過多的爬取資料。

網路爬蟲筆記（Day5）——騰訊社招&拉勾網

分析過程與鏈家是一樣的。騰訊社招完整程式碼如下： import requests from lxml import etree from mysql_class import Mysql # 自己封裝好的Mysql類 def txshezhao(keywords, page):

網路爬蟲筆記（Day5）——鏈家

注意：請不要爬取過多資訊，僅供學習。分析：業務需求分析......（此例為住房資訊...）查詢相關網頁資訊（以鏈家為例）分析URL，查詢我們需要的內容，建立連線定位資料儲存資料首先進入鏈家網首頁，點選租房，F12檢查網頁，查詢我們需要的資訊

網路爬蟲筆記（Day6）——妹子圖

利用多程序爬取妹子圖：http://www.mzitu.com 完整程式碼如下：程序，參看博文程序和執行緒——Python中的實現 import requests from lxml import etree import os import mul

網路爬蟲筆記（Day4）

爬取今日頭條圖集進入今日頭條首頁：https://www.toutiao.com/ 步驟：1、檢視網頁，查詢我們需要的URL，分析URL

網路爬蟲筆記（Day3）

首先分析雪球網 https://xueqiu.com/#/property 第一次進去後，第一次Ajax請求得到的是若下圖所示的 max_id=-1, count=10。然後往下拉，第二次Ajax請求，如下圖；發現URL裡面就max_id 和count不同，

網路爬蟲筆記（Day8）——IP代理

可以去某寶或其他渠道購買，具體使用看自己購買商家的API文件，檢視使用方法。 ip_proxy.py import requests class ip_getter(object): def __init__(self): self.ip_proxy_str =

網路爬蟲筆記（Day8）——BeautifulSoup

BeautifulSoup 我們到網站上爬取資料，需要知道什麼樣的資料是我們想要爬取的，什麼樣的資料是網頁上不會變化的。 Beautiful Soup提供一些簡單的、python式的函式用來處理導航、搜尋、修改分析樹等功能。它是一個工具箱，通過解析文件為使用者提供需要抓取的資料，因為

網路爬蟲筆記（Day7）——Selenium

首先下載chromedriver 將其放入Python執行環境下，然後再去pip安裝selenium。最簡單的結構程式碼如下： from selenium import webdriver # ----------------------不開啟瀏覽器視窗-------------

Python網路爬蟲筆記（7）處理HTTPS請求 SSL證書驗證

現在隨處可見 https 開頭的網站，urllib2可以為 HTTPS 請求驗證SSL證書，就像web瀏覽器一樣，如果網站的SSL證書是經過CA認證的，則能夠正常訪問，如：https://www.baidu.com/等...如果SSL證書驗證不通過，或者作業系統不信任伺服器的

nodejs爬蟲筆記（三）

target ole n+2 如何獲取利用 mod git brush 所有思路：通過筆記（二）中代理的設置，已經可以對YouTube的信息進行爬取了，這幾天想著爬取網站下的視頻信息。通過分析YouTube，可以從訂閱號入手，先選擇幾個訂閱號，然後爬取訂閱號裏面的視頻分

python網絡爬蟲筆記（四）

inf 比較小寫字母網絡爬蟲作用自定義 gpo 外部而且一、python中的高階函數算法 1、sorted()函數的排序 sorted()函數是一個高階函數，還可以接受一個key函數來實現自定義的函數排序，key指定的函數作用於每個序列元素上，並根據key函

python網絡爬蟲筆記（九）

out 模塊 ade npe tex visible 代碼端口號 pac 4.1.1 urllib2 和urllib是兩個不一樣的模塊 urllib2最簡單的就是使用urllie2.urlopen函數使用如下 urllib2.urlopen(url[,

Python網絡爬蟲筆記（五）：下載、分析京東P20銷售數據

9.png amp F12 不存在 strong xls sco 列表 std (一) 分析網頁下載下面這個鏈接的銷售數據 https://item.jd.com/6733026.html#comment 1、翻頁的時候，谷歌F12的Network頁簽可以

python | 爬蟲筆記（五）- 數據存儲

height iter use jordan rip 輕量數據存儲回滾 nosql 5.1 文件存儲先用request把源碼獲取，再用解析庫解析，保存到文本 1- txt 文本打開方式： file = open(‘explore.txt‘, ‘a‘, encodin

python | 爬蟲筆記 - （八）Scrapy入門教程

RoCE yield ini 配置自己數據存儲 2.3 rom 提取數據一、簡介 Scrapy是一個基於Twisted 的異步處理框架，是針對爬蟲過程中的網站數據爬取、結構性數據提取而編寫的應用框架。可以應用在包括數據挖掘，信息處理或存儲歷史數據等一系列的程序中。

神經網路學習筆記（1）Image Classification

學習網站：資料驅動方法 KNN（例如採用L1曼哈頓距離）程式碼如下： import numpy as np class NearestNeighbor: def train(self,X,y): self.Xtrain=X self

Python網路資料爬取----網路爬蟲基礎（一）

The website is the API......(未來的資料都是通過網路來提供的，website本身對爬蟲來講就是自動獲取資料的API)。掌握定向網路資料爬取和網頁解析的基本能力。 ##Requests 庫的使用，此庫是Python公認的優秀的第三方網路爬蟲庫。能夠自動的爬取HTML頁面；自動的

計算機網路讀書筆記（一）概述

一、計算機網路在資訊時代中的作用（1）計算機網路使使用者能夠在計算機之間傳送資料檔案（2）當今世界上最大的計算機網路Internet——網際網路（3）可以從兩個方面來認識網際網路：網際網路的應用和網際網路的工作原理（4）網際網路兩個基本特點：連通性和共享（共享指資源共享，可以

Python爬蟲筆記（一）——基礎知識簡單整理

登陸時候的使用者名稱和密碼可以放在http的頭部也可以放在http的body部分。 HTTPS是否可以抓取由於https運用的加密策略是公開的，所以即使網站使用https加密仍然可以獲得資料，但是類似於微信這樣的app，它自己實現了一套加密演算法，想要抓取資料就變得

協程（三）騰訊libco原始碼分析

騰訊的libco使用了hook技術，做到了在遇到阻塞IO時自動切換協程，（由事件迴圈co_eventloop檢測的）阻塞IO完成時恢復協程，簡化非同步回撥為相對同步方式的功能。其沒有使用顯示的排程器來管理所有協程（儲存協程的相關資料），在協程切換及恢復之間

網路爬蟲筆記（Day5）——騰訊社招&拉勾網

相關推薦