Beautiful Soup爬蟲——爬取智聯招聘的資訊並存入資料庫

阿新 • • 發佈：2018-12-13

本人目前在校本科萌新…第一次寫有所不足還請見諒

前期準備

智聯招聘網頁智聯招聘搜尋網頁讓我們來搜尋一下python 發現網頁跳轉到這讓我們看一下原始碼發現並沒有我們所需要的資料一開始我不信邪用requests嘗試了一下

import requests
headers = {
       'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0',
       'Host': 'sou.zhaopin.com',
       'Referer': 'https://www.zhaopin.com/',
       'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
       'Accept-Encoding': 'gzip, deflate',
       'Accept-Language': 'zh-CN,zh;q=0.9',
       }
url = 'https://sou.zhaopin.com/?pageSize=60&jl=530&kw=python&kt=3'
re = requests.get(url,headers=headers)
print(re.text)

import requests
headers = {
       'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0',
       'Host': 'sou.zhaopin.com',
       'Referer': 'https://www.zhaopin.com/',
       'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
       'Accept-Encoding': 'gzip, deflate',
       'Accept-Language': 'zh-CN,zh;q=0.9',
       'Cookie':'ZP_OLD_FLAG=true'
       }
url = 'https://sou.zhaopin.com/jobs/searchresult.ashx?jl=北京&kw=python&sm=0&p=1'
re = requests.get(url,headers=headers)
print(re.text)

cookie這裡表示是舊版網頁發現確實有招聘的資料，這裡就不發截圖了。

程式碼

我用了json儲存了一些變數，方便更改 spider.json

{
  "host":"localhost",
  "user":"root",
  "password":"",
  "dbname":"vacation",
  "port":3306,
  "city":"北京",
  "keyword":"python",
  "page":90,
  "Cookie":"ZP_OLD_FLAG=true;"
}

程式碼


from bs4 import BeautifulSoup
import requests
from requests.exceptions import RequestException
import pymysql
import json
f = open("spider.json",encoding='utf-8')
setting = json.load(f)
host = setting['host']
user = setting['user']
password = setting['password']
dbname = setting['dbname']
port = setting['port']
city = setting['city']
keyword = setting['keyword']
pagenum = setting['page']
Cookie = setting['Cookie']
def get_one_page(city, keyword, page):
   '''
   獲取網頁html內容並返回
   '''
   headers = {
       'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0',
       'Host': 'sou.zhaopin.com',
       'Referer': 'https://www.zhaopin.com/',
       'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
       'Accept-Encoding': 'gzip, deflate',
       'Accept-Language': 'zh-CN,zh;q=0.9',
       'Cookie':Cookie
   }
   url = 'https://sou.zhaopin.com/jobs/searchresult.ashx?jl={}&kw={}&sm=0&p={}'.format(city,keyword,page)
   try:
       # 獲取網頁內容，返回html資料
       response = requests.get(url, headers=headers)
       # 通過狀態碼判斷是否獲取成功
       if response.status_code == 200:
           return response.text
       return None
   except RequestException as e:
       return None
def readonepage(html,db):
    cur = db.cursor()
    soup = BeautifulSoup(html,'lxml')
    for x in soup.find_all('td'):
        try:
            sybo = x.get('class')
            if sybo ==['zwmc']:
                jobname = x.div.a.get_text() #崗位名稱
                jobhref = x.div.a.get('href')
                if jobhref[9] == 'i':
                    pass
                list = get_detailed(jobhref)
                list.append(jobname)
                print(jobname)
                sql = "INSERT INTO companyinfo(company_name,work_experience,edu_background,salary,describes,work_city,work_address,nature,types,scales,url,benefits,station,station_id)VALUES ('%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s','%s',%d)" % (
                list[0], list[1], list[2], list[3], list[4], list[5], list[6], list[7], list[8], list[9], list[10],
                list[11], list[12],counterid())
                    # try:
                cur.execute(sql)
                db.commit()
        except Exception as e:
            pass
def counterid(last=[0]):#用來儲存資料庫的id
    #last[0]將列表裡面的第一個元素取出，然後加1，賦值給next
    next = last[0] + 1
    #修改列表裡面第一個元素的值
    last[0] = next
    #返回此時執行的次數
    return next
def get_detailed(href):
    res = requests.get(href)
    soup = BeautifulSoup(res.text, 'lxml')
    for x in soup.find_all('ul'):
        try:
            sybo = x.get('class')
            # print(sybo)
            if sybo == ['terminal-ul', 'clearfix']:
                jobinfor = x.get_text()
                str = jobinfor.split('\n')
                salary = str[1].split('：')[1]#薪水
                salary = "".join(salary.split()) #去掉特殊符號
                city = str[2].split('：')[1]#城市
                exp = str[5].split('：')[1]#工作經驗
                edu = str[6].split('：')[1]#學歷
                # numb = str[7].split('：')[1]需求人數
        except Exception as e:
            print(e)
    for x in soup.find_all('div'):
        try:
            sybo = x.get('class')
            if sybo == ['company-box']:
                str2 = x.get_text().split('\n')
                while '' in str2:
                    str2.remove('')
                    if '檢視公司地圖' in str2:
                        str2.remove('檢視公司地圖')
                comname = str2[0]#公司名稱
                scale = str2[1].split('：')[1]#企業規模
                nature = str2[2].split('：')[1]# 民營 國營
                type = str2[3].split('：')[1]# 型別：計算機/教育/so on
                place = str2[-1]#具體地址
                if len(str2) == 6:#有的公司沒有網址
                    website = ' '
                else:
                    website = str2[4].split('：')[1]#公司網站
                # print(comname, scale, nature, type, place, website)
            if sybo == ['tab-inner-cont']:
                sty = x.get('style')
                if sty == None:
                    descrip = x.get_text().split('\n')[1]#工作需求
                    descrip = "".join(descrip.split())#去掉特殊符號
                    # print(descrip)
            if sybo == ['welfare-tab-box']:
                fuli=''
                for elem in x:
                    fuli = fuli + elem.string +' '
                # print(x.get_text())
        except Exception as e:
            print(e)

    return [comname,exp,edu,salary,descrip,city,place,nature,type,scale,website,fuli]

def main(city, keyword, pages):
    db = pymysql.connect(host=host, user=user , password=password, db=dbname, port=port)
    for i in range(pages):
        html = get_one_page(city, keyword, i)
        readonepage(html,db)
    db.close()


if __name__ == '__main__':
   main(city, keyword, pagenum)

資料庫結構在這裡插入圖片描述資料庫裡儲存的資訊

Beautiful Soup爬蟲——爬取智聯招聘的資訊並存入資料庫

本人目前在校本科萌新…第一次寫有所不足還請見諒前期準備智聯招聘網頁讓我們來搜尋一下python 發現網頁跳轉到這讓我們看一下原始碼發現並沒有我們所需要的資料一開始我不信邪用requests嘗試了一下 import requests header

(轉)python爬蟲例項——爬取智聯招聘資訊

受友人所託，寫了一個爬取智聯招聘資訊的爬蟲，與大家分享。本文將介紹如何實現該爬蟲。目錄網頁分析網頁的組織結構如下：將網頁程式碼儲存為html檔案（檔案見

python爬蟲例項——爬取智聯招聘資訊

受友人所託，寫了一個爬取智聯招聘資訊的爬蟲，與大家分享。本文將介紹如何實現該爬蟲。目錄網頁分析網頁的組織結構如下：將網頁程式碼儲存為html檔案（檔案見最後連結），使用的軟體是Sublime Text，我們所需的內容如下圖所示：

Python爬蟲爬取智聯招聘職位資訊

目的：輸入要爬取的職位名稱，五個意向城市，爬取智聯招聘上的該資訊，並列印進表格中 #coding:utf-8 import urllib2 import re import xlwt class ZLZP(object): def __init__(self

python3 爬蟲爬取智聯招聘崗位資訊

這套程式基於python3 ，使用requests和re正則表示式，只需要將程式儲存為.py檔案後，即可將抓取到的資料儲存到指定路徑的Excel檔案中。程式在終端中啟動，啟動命令： #python3 檔名.py 關鍵字城市 python3 zhilian.p

爬蟲二：爬取智聯招聘職位資訊

1. 簡介因為想要找到一個數據分析的工作，能夠了解到市面上現有的職位招聘資訊也會對找工作有所幫助。今天就來爬取一下智聯招聘上資料分析師的招聘資訊，並存入本地的MySQL。 2. 頁面分析 2.1 找到資料來源開啟智聯招聘首頁，選擇資料分析師職位，跳轉進入資料分析師的詳情頁面。我

python爬蟲例項之爬取智聯招聘資料

這是作者的處女作，輕點噴。。。。實習在公司時領導要求學習python，python的爬蟲作為入門來說是十分友好的，話不多說，開始進入正題。主要是爬去智聯的崗位資訊進行對比分析出java和python的趨勢，爬取欄位：工作地點，薪資範圍，要求學歷，

【爬蟲入門】【Json】爬取智聯招聘

爬蟲中也會經常會遇到以JSON資料返回內容的網站，這種網站不再需要使用正則表示式匹配文字，直接分析網站是否含有介面返回JSON，如果有，直接使用json.load()對json字串進行解析就可以獲取資料。 # pip install requests:比較流行的第三方請求庫 #https

selenium+PyQuery+chrome headless 爬取智聯招聘求職資訊

最近導師讓自己摸索摸索Python爬蟲，好了就開始一發不可收拾的地步。正巧又碰到有位同學需要一些求職資訊對求職資訊進行資料分析，本著練練手的目的寫了用Python爬取智聯招聘網站的資訊。這一爬取不得了，智聯網站更新了，以前的大佬們的程式碼不能用，而且全是動態載入，反爬蟲著實對

scrapy 爬取智聯招聘

準備工作　　1. scrapy startproject Jobs　　2. cd Jobs　　3. scrapy genspider ZhaopinSpider www.zhaopin.com　　4. scrapy crawl ZhaopinSpider　　5. pip install d

python爬蟲——爬取豆瓣電影top250資訊並載入到MongoDB資料庫中

最近在學習關於爬蟲方面的知識，因為剛開始接觸，還是萌新，所以有什麼錯誤的地方，歡迎大家指出 from multiprocessing import Pool from urllib.request import Request, urlopen import re, pymongo index

使用scrapy框架,用模擬瀏覽器的方法爬取京東上面膜資訊,並存入mysql,sqlite,mongodb資料庫

因為京東的頁面是由JavaScript動態載入的所以使用模擬瀏覽器的方法進行爬取,具體程式碼如下 : spider.py # -*- coding: utf-8 -*- import scrapy from scrapy import Request from jdpro.items

Python爬蟲之五：抓取智聯招聘基礎版

對於每個上班族來說，總要經歷幾次換工作，如何在網上挑到心儀的工作？如何提前為心儀工作的面試做準備？今天我們來抓取智聯招聘的招聘資訊，助你換工作成功！執行平臺： Windows Python版本： Python3.6 IDE： Sublime Te

手把手帶你抓取智聯招聘的“資料分析師”崗位！

前言很多網友在後臺跟我留言，是否可以分享一些爬蟲相關的文章，我便提供了我以前寫過的爬蟲文章的連結（如下連結所示），大家如果感興趣的話也可以去看一看哦。在本文中，我將以智聯招聘為例，分享一下如何抓取近5000條的資料分析崗資訊。往期爬蟲連結上海歷史天氣和空氣質量資料獲取（Pyth

scrapy模擬瀏覽器翻頁爬取智聯

智聯爬取中,頁碼的數字和url是不匹配的,因此盲目的拼接url會造成錯誤,因此可以採用模擬瀏覽器爬取網頁要模擬瀏覽器需要知道scrapy流程,簡圖如下: 這裡只是簡單的寫一些偽碼,設計的資料清洗部分請看scrapy資料清洗 middleswares.py from scrap

使用scrapy框架+模擬瀏覽器方法實現爬取智聯的職位資訊

由於智聯的頁面是由js動態載入的,一般的方法只能得到js載入前的頁面,為了得到載入過的頁面需要通過模擬瀏覽器來拿到完整的頁面. 下面的程式碼只是簡單的實現,爬取智聯頁面的部分功能,其他根據需要自己實現中介軟體(middleswares.py)程式碼: from scrapy.ht

Python，自己修改的爬去淘寶網頁的程式碼解決Python爬蟲爬取淘寶商品資訊也不報錯，也不輸出資訊

程式碼部分：下面是正確的： import requests import re def getHTMLText(url): try: r = requests.get(url, timeout = 30) r.raise_for_stat

scrapy-redis例項，分佈爬蟲爬取騰訊新聞，儲存在資料庫中

本篇文章為scrapy-redis的例項應用，原始碼已經上傳到github: https://github.com/Voccoo/NewSpider 使用到了： python 3.x redis scrapy-redis pymysql Redis-Desktop-Manage

Python資料爬蟲學習筆記（21）爬取京東商品JSON資訊並解析

一、需求：有一個通過抓包得到的京東商品的JSON連結，解析該JSON內容，並提取出特定id的商品價格p，json內容如下： jQuery923933([{"op":"7599.00","m":"9999.00","id":"J_5089253","p":"7099.00"}

python爬蟲爬取非同步載入網頁資訊（python抓取網頁中無法通過網頁標籤屬性抓取的內容）

1.問題描述最近由於學習內容的要求，需要從網頁上抓取一些資料來做分析報告，在看了python爬蟲的一些基礎知識之後就直接上手去網站上爬資料了。作為新手踩坑是無法避免，最近就遇到了一個比較難的問題：一般情況下，要抓去網頁上某個標籤上的內容，在通過urllib下

Beautiful Soup爬蟲——爬取智聯招聘的資訊並存入資料庫

前期準備

程式碼

相關推薦