【爬蟲小程式：爬取鬥魚所有房間資訊】Xpath(多程序版)

阿新 • • 發佈：2018-12-05

# 本程式親測有效,用於理解爬蟲相關的基礎知識，不足之處希望大家批評指正

 1 import requests
 2 from lxml import etree
 3 from multiprocessing import JoinableQueue as Queue
 4 from multiprocessing import Process
 5 
 6 """爬取目標：http://www.qiushibaike.com/8hr/page/1
 7     利用多程序實現
 8 """
 9 
10 class QiuShi:
11     def __init__(self):
 
12         # url和headers
13         self.base_url = 'http://www.qiushibaike.com/8hr/page/{}'
14         self.headers = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.67 Safari/537.36'
15 
16         # 定義佇列，用來傳遞資料
17         self.url_queue = Queue()
18         self.request_queue = Queue()
 
19         self.html_queue = Queue()
20 
21 
22     def get_url_list(self):
23         """獲取所有的url"""
24         for i in range(1, 14):
25             target_url = self.base_url.format(i)
26             print(target_url)
27             # 計數需要請求的url
28             self.url_queue.put(target_url)
29 
30     def 
 request_url(self):
31         """向url發起請求"""
32         while True:
33             target_url = self.url_queue.get()
34             response = requests.get(target_url, self.headers)
35             print(response)
36             self.request_queue.put(response)
37             self.url_queue.task_done()
38 
39     def get_content(self):
40         """獲取資料"""
41         while True:
42             html_text = self.request_queue.get().content.decode()
43             html = etree.HTML(html_text)
44             div_list = html.xpath('//div[@id="content-left"]/div')
45             content_list = []
46             for div in div_list:
47                 item = {}
48                 item['author'] = div.xpath('.//h2/text()')[0].strip()
49                 item['content'] = div.xpath('.//span/text()')[0].strip()
50                 print(item)
51                 content_list.append(item)
52             self.html_queue.put(content_list)
53             self.request_queue.task_done()
54 
55     def save_data(self):
56         """儲存入庫"""
57         while True:
58             data_list = self.html_queue.get()
59             for data in data_list:
60                 with open('qiushi.text', 'a+') as f:
61                     f.write(str(data))
62                     f.write('\r\n')
63             self.html_queue.task_done()
64 
65     def main(self):
66 
67         # 獲取所有url
68         self.get_url_list()
69         # 建立一個程序盒子,用於收集程序
70         process_list = []
71         p_request = Process(target=self.request_url)
72         process_list.append(p_request)
73 
74         p_content = Process(target=self.get_content)
75         process_list.append(p_content)
76 
77         p_save_data = Process(target=self.save_data)
78         process_list.append(p_save_data)
79 
80         # 讓所有程序跑起來
81         for process in process_list:
82             process.daemon = True  # 設定為守護程序：主程序結束，子程序任務完不完成，都要隨著主程序結束而結束
83             process.start()
84 
85         # 等待程序任務完成,回收程序  director:主任：普通員工都下班了，主任才能下班
86         for director in [self.url_queue,self.request_queue,self.html_queue]:
87             director.join()
88 
89 
90 
91 if __name__ == '__main__':
92     qiushi = QiuShi()
93     qiushi.main()

【爬蟲小程式：爬取鬥魚所有房間資訊】Xpath(多程序版)

# 本程式親測有效,用於理解爬蟲相關的基礎知識，不足之處希望大家批評指正 1 import requests 2 from lxml import etree 3 from multiprocessing import JoinableQueue as Queue 4 from

【爬蟲小程式：爬取鬥魚所有房間資訊】Xpath(多執行緒版)

# 本程式親測有效,用於理解爬蟲相關的基礎知識，不足之處希望大家批評指正 from queue import Queue import requests from lxml import etree from threading import Thread "

【爬蟲小程式：爬取鬥魚所有房間資訊】Xpath(執行緒池版)

# 本程式親測有效,用於理解爬蟲相關的基礎知識，不足之處希望大家批評指正 from queue import Queue import requests from lxml import etree from multiprocessing.dummy import Pool import t

Python爬蟲練手小專案：爬取窮遊網酒店資訊

Python爬蟲練手小專案：爬取窮遊網酒店資訊 Python學習資料或者需要程式碼、視訊加Python學習群：960410445 前言對於初學者而言，案例主要的是為了讓大家練手，明白其中如何這樣寫的思路，而不是拿著程式碼執行就完事了。基本環境配置系統

Python爬蟲實戰一：爬取csdn學院所有課程名、價格和課時

import urllib.request import re,xlwt,datetime class csdn_spider(): def __init__(self): self.c = 0 def sava_data(self,name,class_num,price

多線程Beatiful Soup爬取鬥魚所有在線主播的信息

category con 讀取教程 stc https rom webkit date 　　最近看了個爬蟲的教程，想著自己也常在鬥魚看直播，不如就拿它來練練手。於是就寫了個爬取鬥魚所有在線主播的信息，分別為類別、主播ID、房間標題、人氣值、房間地址。　　需要用到的工具p

爬蟲+詞雲：爬取豆瓣電影top100的導演制作圖雲

ray 爬取 open tex 下載頁面 down app zhong form 前段時間做了一個關於豆瓣電影的爬蟲，之後又寫了一個陳奕迅歌詞的詞雲制作，於是我想不如做一個關於豆瓣高分電影導演的詞雲試試，於是有了接下來這篇隨筆。首先，我需要知道豆瓣top100電影詳情頁面

爬蟲任務二：爬取(用到htmlunit和jsoup)通過百度搜索引擎關鍵字搜取到的新聞標題和url，並保存在本地文件中（主體借鑒了網上的資料）

標題 code rgs aps snap one reader url 預處理采用maven工程，免著到處找依賴jar包 <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http:

Python爬蟲系列 - 初探：爬取旅遊評論

blank .text http fir win64 ati coo get stat Python爬蟲目前是基於requests包，下面是該包的文檔，查一些資料還是比較方便。 http://docs.python-requests.org/en/master/ 爬取某旅遊

Python爬蟲系列 - 初探：爬取新聞推送

http nec apple 下標 for pri Language span round Get發送內容格式 Get方式主要需要發送headers、url、cookies、params等部分的內容。 t = requests.get(url, headers = hea

Python：爬蟲例項2：爬取貓眼電影——破解字型反爬

字型反爬字型反爬也就是自定義字型反爬，通過呼叫自定義的字型檔案來渲染網頁中的文字，而網頁中的文字不再是文字，而是相應的字型編碼，通過複製或者簡單的採集是無法採集到編碼後的文字內容的。現在貌似不少網站都有采用這種反爬機制，我們通過貓眼的實際情況來解釋一下。下圖的是貓眼網頁

Python爬蟲實例：爬取B站《工作細胞》短評——異步加載信息的爬取

localtime pre global web for short sco 網頁解析 save 《工作細胞》最近比較火，bilibili 上目前的短評已經有17000多條。先看分析下頁面右邊 li 標簽中的就是短評信息，一共20條。一般我們加載大量數據的時候，都

Python爬蟲-爬取鬥魚網頁selenium+bs

爬取鬥魚網頁（selenium+chromedriver得到網頁，用Beasutiful Soup提取資訊） ============================= ================================= =============================

爬蟲二：爬取智聯招聘職位資訊

1. 簡介因為想要找到一個數據分析的工作，能夠了解到市面上現有的職位招聘資訊也會對找工作有所幫助。今天就來爬取一下智聯招聘上資料分析師的招聘資訊，並存入本地的MySQL。 2. 頁面分析 2.1 找到資料來源開啟智聯招聘首頁，選擇資料分析師職位，跳轉進入資料分析師的詳情頁面。我

爬蟲練習一：爬取睿奢圖片

爬取網站：睿奢-套裝合集-私房定製目標：爬取並儲存該網站分類下每個主題的所有圖片 python版本：python 3.6 使用庫：urllib，Beautifulsoup，os，random，re，time 對網站進行訪問檢視首先需要通過瀏覽器對目標網站進行訪問，瞭解該網站的頁面

Python爬蟲練習三：爬取豆瓣電影分類排行榜

目標網址url: https://movie.douban.com/typerank?type_name=%E5%8A%A8%E4%BD%9C&type=5&interval_id=100:90&action= 使用谷歌瀏覽器的檢查

【圖文詳解】scrapy爬蟲與動態頁面——爬取拉勾網職位資訊（1）

5-14更新注意：目前拉勾網換了json結構，之前是content - result 現在改成了content- positionResult - result,所以大家寫程式碼的時候要特別注意加上

爬蟲學習之17：爬取拉勾網網招聘資訊（非同步載入+Cookie模擬登陸）

很多網站需要通過提交表單來進行登陸或相應的操作，可以用requests庫的POST方法，通過觀測表單原始碼和逆向工程來填寫表單獲取網頁資訊。本程式碼以獲取拉勾網Python相關招聘職位為例作為練習。開啟拉鉤網，F12進入瀏覽器開發者工具，可以發現網站使用了A

python爬蟲十五：爬取12306火車票資訊

轉：https://zhuanlan.zhihu.com/p/26701898 # -*- coding: utf-8 -*- ''' 獲取12306城市名和城市程式碼的資料檔名： parse_station.py ''' import requests import

爬蟲小練手-爬取慕課網首頁的圖片

#!/usr/bin/python #-*- coding:utf-8 -*- import re import requests import Queue import threading import urllib from bs4 import BeautifulSo

【爬蟲小程式：爬取鬥魚所有房間資訊】Xpath(多程序版)

相關推薦