python 爬蟲系列09-selenium+拉鉤

阿新 • • 發佈：2018-11-13

使用selenium爬取拉勾網職位

 1 from selenium import webdriver
 2 from lxml import etree
 3 import re
 4 import time
 5 from selenium.webdriver.support.ui import WebDriverWait
 6 from selenium.webdriver.support import expected_conditions as EC
 7 from selenium.webdriver.common.by import By
 8 class LagouSpider(object):
 9 
     driver_path = r"D:\driver\chromedriver.exe"
10 
11     def __init__(self):
12         self.driver = webdriver.Chrome(executable_path=LagouSpider.driver_path)
13         self.url = 'https://www.lagou.com/jobs/list_%E4%BA%91%E8%AE%A1%E7%AE%97?labelWords=&fromSearch=true&suginput='
14         self.positions = []
 
15 
16     def run(self):
17         self.driver.get(self.url)
18         while True:
19             source = self.driver.page_source
20             WebDriverWait(driver=self.driver,timeout=10).until(
21                 EC.presence_of_element_located((By.XPATH, "//div[@class='pager_container']/span[last()] 
"))
22             )
23             self.parse_list_page(source)
24             try:
25                 next_btn = self.driver.find_element_by_xpath("//div[@class='pager_container']/span[last()]")
26                 if "pager_next_disabled" in next_btn.get_attribute("class"):
27                     break
28                 else:
29                     next_btn.click()
30             except:
31                 print(source)
32 
33             time.sleep(1)
34 
35     def parse_list_page(self,source):
36         html = etree.HTML(source)
37         links = html.xpath("//a[@class='position_link']/@href")
38         for link in links:
39             self.request_detail_page(link)
40             time.sleep(1)
41 
42     def request_detail_page(self,url):
43         # self.driver.get(url)
44         print()
45         print(url)
46         print()
47         self.driver.execute_script("window.open('%s')" % url)
48         self.driver.switch_to.window(self.driver.window_handles[1])
49         WebDriverWait(self.driver,timeout=10).until(
50             EC.presence_of_element_located((By.XPATH,"//div[@class='job-name']/span[@class='name']"))
51         )
52         source = self.driver.page_source
53         self.parse_detail_page(source)
54         self.driver.close()
55         self.driver.switch_to.window(self.driver.window_handles[0])
56 
57     def parse_detail_page(self,source):
58         html = etree.HTML(source)
59         position_name = html.xpath("//span[@class='name']/text()")[0]
60         job_request_spans = html.xpath("//dd[@class='job_request']//span")
61         salary = job_request_spans[0].xpath('.//text()')[0].strip()
62         city = job_request_spans[1].xpath(".//text()")[0].strip()
63         city = re.sub(r"[\s/]", "", city)
64         work_years = job_request_spans[2].xpath(".//text()")[0].strip()
65         work_years = re.sub(r"[\s/]", "", work_years)
66         education = job_request_spans[3].xpath(".//text()")[0].strip()
67         education = re.sub(r"[\s/]", "", education)
68         desc = "".join(html.xpath("//dd[@class='job_bt']//text()")).strip()
69         company_name = html.xpath("//h2[@class='f1']/text()")
70         position = {
71             'name': position_name,
72             'company_name': company_name,
73             'salary': salary,
74             'city': city,
75             'work_years': work_years,
76             'education': education,
77             'desc': desc
78         }
79         self.positions.append(position)
80         print(position)
81 if __name__ == '__main__':
82     spider = LagouSpider()
83     spider.run()

python 爬蟲系列09-selenium+拉鉤

使用selenium爬取拉勾網職位 1 from selenium import webdriver 2 from lxml import etree 3 import re 4 import time 5 from selenium.webdriver.support.ui imp

python爬蟲系列（3）：使用Selenium和BeautifulSoup獲取12306一個月內所有車次車票情況

首先針對標題說明一下，本次的獲取資料是指定出發地和目的地之間的車次，不是整個網站所有車次。在此操作之前，請確保自己的相關的庫都已經安裝完全，這裡可沒有教安裝庫的方法哦~~~~好的，往下走，這次的目標網頁是 https://kyfw.12306.cn/otn/leftTic

Python爬蟲系列（一）：從零開始，安裝環境

tar 公司 pip nal 網頁解析目標 http caption 在上一個系列，我們學會使用rabbitmq。本來接著是把公司的celery分享出來，但是定睛一看，celery4.0已經不再支持Windows。公司也逐步放棄了服役多年的celery項目。恰好，公司找

Python爬蟲系列（四）：Beautiful Soup解析HTML之把HTML轉成Python對象

調用 nor 結束版本現在 name屬性 data 官方文檔 get 在前幾篇文章，我們學會了如何獲取html文檔內容，就是從url下載網頁。今天開始，我們將討論如何將html轉成python對象，用python代碼對文檔進行分析。 (牛小妹在學校折騰了好幾天，也沒把h

Python爬蟲系列：判斷目標網頁編碼的幾種方法

qpi data- tps 分享運行 ofo html nbsp 來看在爬取網頁內容時，了解目標網站所用編碼是非常重要的，本文介紹幾種常用的方法，並使用幾個網站進行簡單測試。代碼運行結果：從不同國家的幾個網站測試結果來看，utf8使用的較多（對於純英文網站，用什

Python 爬蟲系列：糗事百科最熱段子

image .get headers BE write findall parse 調用 with open 1.獲取糗事百科url http://www.qiushibaike.com/hot/page/2/ 末尾2指第2頁 2.分析頁面，找到段子部分的位置，

python爬蟲筆記----4.Selenium庫（自動化庫）

locate pri 官方文檔 input 顯式 ref 打開網頁 elements timeout 4.Selenium庫 (自動化測試工具，支持多種瀏覽器，爬蟲主要解決js渲染的問題) pip install selenium 基本使用 from seleniu

Python爬蟲教程-09-error 模塊

read tps exception url exceptio from 失敗 mark err Python爬蟲教程-09-error模塊今天的主角是error，爬取的時候，很容易出現錯，所以我們要在代碼裏做一些，常見錯誤的處，關於urllib.error URLErr

Python爬蟲教程-26-Selenium + PhantomJS

code scrip class 變換打印 ESS 情況 block font Python爬蟲教程-26-Selenium + PhantomJS 動態前端頁面： JavaScript： JavaScript一種直譯式腳本語言，是一種動態類型、弱類型、基於原型的語

Python爬蟲教程-28-Selenium 操縱 Chrome

渲染 oba 介紹兼容拷貝輸入框 keys 拖拽 chrome 我覺得本篇是很有意思的，閑著沒事來看看！ Python爬蟲教程-28-Selenium 操縱 Chrome PhantomJS 幽靈瀏覽器，無界面瀏覽器，不渲染頁面。Selenium + PhantomJ

Python爬蟲系列 - 初探：爬取旅遊評論

blank .text http fir win64 ati coo get stat Python爬蟲目前是基於requests包，下面是該包的文檔，查一些資料還是比較方便。 http://docs.python-requests.org/en/master/ 爬取某旅遊

python 爬蟲系列03--職位爬蟲

職位爬蟲 import requests from lxml import etree cookie = { 'Cookie':'user_trace_token=20181015184304-692c4bf4-4e71-4cfd-8906-6219253e0ae8; _ga=GA1

python 爬蟲系列02-小說

本爬蟲為網路上的.. # # -*- coding:UTF-8 -*- # from bs4 import BeautifulSoup # import requests # if __name__ == '__main__': # target = 'https://www.biqu

Python爬蟲系列 - 初探：爬取新聞推送

http nec apple 下標 for pri Language span round Get發送內容格式 Get方式主要需要發送headers、url、cookies、params等部分的內容。 t = requests.get(url, headers = hea

python爬蟲系列(3.4-使用xpath和lxml爬取伯樂線上)

一、爬取的程式碼 1、網站地址 2、具體實現程式碼 import requests from lxml import etree class JobBole(object): def __init__(self): &

python爬蟲系列(3.2-lxml庫的使用)

一、基本介紹 1、lxml 是一個HTML/XML的解析器，主要的功能是如何解析和提取 HTML/XML 資料。 2、lxml和正則一樣，也是用 C 實現的，是一款高效能的 Python HTML/XML 解析器，我們可

python爬蟲系列(3.1-xpath語法的介紹)

一、關於xpath的認識 xpath（XML Path Language）是一門在XML和HTML文件中查詢資訊的語言，可用來在XML和HTML文件中對元素和屬性進行遍歷。二、xpath的基本語法 1、選擇節點 2、謂語謂語是用來找出某個特定的

python爬蟲系列(2.3-requests庫模擬使用者登入)

一、模擬登入拉鉤網 import re import requests class LoginLaGou(object): """ 模擬登入拉鉤網 """

python爬蟲系列(2.2-requests庫的高階使用)

一、設定代理ip 1、直接在請求的時候加上proxies就可以,注意我們一般會寫上http和https的,這樣當遇到http請求就會走http字典對應的代理 2、具體程式碼 import requests if __name__ == "__main__":

python爬蟲系列(2.1-requests庫的基本的使用)

一、基本認識 1、傳送一個get請求 import requests if __name__ == "__main__": # 獲取一個get請求 response = requests.get('http://htt

python 爬蟲系列09-selenium+拉鉤

相關推薦