案例_(多線線程)爬取糗事百科

阿新 • • 發佈：2018-09-05

false 內容圖片 nbsp strip 5.0 mpat 交流 strong

  1 # 使用了線程庫
  2 import threading
  3 # 隊列
  4 from queue import Queue
  5 # 解析庫
  6 from lxml import etree
  7 # 請求處理
  8 import requests
  9 # json處理
 10 import time
 11 
 12 
 13 class ThreadCrawl(threading.Thread):
 14     def __init__(self, thread_name, page_queue, data_queue):
 15         #threading.Thread.__init__(self) 

 16         # 調用父類初始化方法
 17         super(ThreadCrawl, self).__init__()
 18         # 線程名
 19         self.thread_name = thread_name
 20         # 頁碼隊列
 21         self.page_queue = page_queue
 22         # 數據隊列
 23         self.data_queue = data_queue
 24         # 請求報頭
 25         self.headers = {" 
User-Agent" : "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;"}
 26 
 27     def run(self):
 28         print("啟動 " + self.thread_name)
 29         while not CRAWL_EXIT:
 30             try:
 31                 # 取出一個數字，先進先出
 32                 # 可選參數block，默認值為True
 33                 #1. 如果對列為空，block為True的話，不會結束，會進入阻塞狀態，直到隊列有新的數據 

 34                 #2. 如果隊列為空，block為False的話，就彈出一個Queue.empty()異常，
 35                 page = self.page_queue.get(False)
 36                 url = "http://www.qiushibaike.com/8hr/page/" + str(page) +"/"
 37                 #print url
 38                 content = requests.get(url, headers = self.headers).text
 39                 time.sleep(1)
 40                 self.data_queue.put(content)
 41                 #print len(content)
 42             except:
 43                 pass
 44         print("結束 " + self.thread_name)
 45 
 46 class ThreadParse(threading.Thread):
 47     def __init__(self, thread_name, data_queue):
 48         super(ThreadParse, self).__init__()
 49         # 線程名
 50         self.thread_name = thread_name
 51         # 數據隊列
 52         self.data_queue = data_queue
 53 
 54     def run(self):
 55         print("啟動" + self.thread_name)
 56         while not PARSE_EXIT:
 57             try:
 58                 html = self.data_queue.get(False)
 59                 self.parse(html)
 60             except:
 61                 pass
 62         print("退出" + self.thread_name)
 63 
 64     def parse(self, html):
 65         selector = etree.HTML(html)
 66 
 67         # 返回所有段子的節點位置,contant()模糊查詢方法,第一個參數是要匹配的標簽,第二個參數是這個標簽的部分內容
 68         # 每個節點包括一條完整的段子(用戶名,段子內容,點贊,評論等)
 69         node_list = selector.xpath(‘//div[contains(@id,"qiushi_tag_")]‘)
 70 
 71         for node in node_list:
 72             # 爬取所有用戶名信息
 73             # 取出標簽裏的內容,使用.text方法
 74             user_name = node.xpath(‘./div[@class="author clearfix"]//h2‘)[0].text
 75 
 76             # 爬取段子內容,匹配規則必須加點  不然還是會從整個頁面開始匹配
 77             # 註意:如果span標簽中有br 在插件中沒問題,在代碼中會把br也弄進來
 78             duanzi_info = node.xpath(‘.//div[@class="content"]/span‘)[0].text.strip()
 79 
 80             # 爬取段子的點贊數
 81             vote_num = node.xpath(‘.//span[@class="stats-vote"]/i‘)[0].text
 82 
 83             # 爬取評論數
 84             comment_num = node.xpath(‘.//span[@class="stats-comments"]//i‘)[0].text
 85 
 86             # 爬取圖片鏈接
 87             # 屬性src的值,所以不需要.text
 88             img_url = node.xpath(‘.//div[@class="thumb"]//@src‘)
 89             if len(img_url) > 0:
 90                 img_url = img_url[0]
 91             else:
 92                 img_url = "無圖片"
 93 
 94             self.save_info(user_name, duanzi_info, vote_num, comment_num, img_url)
 95 
 96     def save_info(self, user_name, duanzi_info, vote_num, comment_num, img_url):
 97         """把每條段子的相關信息寫進字典"""
 98         item = {
 99             "username": user_name,
100             "content": duanzi_info,
101             "zan": vote_num,
102             "comment": comment_num,
103             "image_url": img_url
104         }
105 
106         print(item)
107 
108 
109 
110 CRAWL_EXIT = False
111 PARSE_EXIT = False
112 
113 
114 def main():
115     # 頁碼的隊列，表示20個頁面
116     pageQueue = Queue(20)
117     # 放入1~10的數字，先進先出
118     for i in range(1, 21):
119         pageQueue.put(i)
120 
121     # 采集結果(每頁的HTML源碼)的數據隊列，參數為空表示不限制
122     dataQueue = Queue()
123 
124     filename = open("duanzi.json", "a")
125     # 創建鎖
126     lock = threading.Lock()
127 
128     # 三個采集線程的名字
129     crawlList = ["采集線程1號", "采集線程2號", "采集線程3號"]
130     # 存儲三個采集線程的列表集合
131     thread_crawl = []
132     for threadName in crawlList:
133         thread = ThreadCrawl(threadName, pageQueue, dataQueue)
134         thread.start()
135         thread_crawl.append(thread)
136 
137 
138     # 三個解析線程的名字
139     parseList = ["解析線程1號","解析線程2號","解析線程3號"]
140     # 存儲三個解析線程
141     thread_parse = []
142     for threadName in parseList:
143         thread = ThreadParse(threadName, dataQueue)
144         thread.start()
145         thread_parse.append(thread)
146 
147     # 等待pageQueue隊列為空，也就是等待之前的操作執行完畢
148     while not pageQueue.empty():
149         pass
150 
151     # 如果pageQueue為空，采集線程退出循環
152     global CRAWL_EXIT
153     CRAWL_EXIT = True
154 
155     print("pageQueue為空")
156 
157     for thread in thread_crawl:
158         thread.join()
159 
160     while not dataQueue.empty():
161         pass
162 
163     global PARSE_EXIT
164     PARSE_EXIT = True
165 
166     for thread in thread_parse:
167         thread.join()
168 
169 
170 
171 if __name__ == "__main__":
172     main()

爬取效果如下:

技術分享圖片

如果你和我有共同愛好,我們可以加個好友一起交流哈!

案例_(多線線程)爬取糗事百科

false 內容圖片 nbsp strip 5.0 mpat 交流 strong 1 # 使用了線程庫 2 import threading 3 # 隊列 4 from queue import Queue 5 # 解析庫 6 from lxml

python—多協程爬取糗事百科熱圖

wow64 monk 根據 list 網址 real span 本地 uil 今天在使用正則表達式時未能解決實際問題，於是使用bs4庫完成匹配，通過反復測試，最終解決了實際的問題，加深了對bs4.BeautifulSoup模塊的理解。爬取流程前奏：分析糗事百科熱圖板塊

使用python的requests、xpath和多執行緒爬取糗事百科的段子

程式碼主要使用的python中的requests模組、xpath功能和threading多執行緒爬取了糗事百科中段子的內容、圖片和閱讀數、段子作者的性別，年齡和頭像。 # author: aspiring import requests from lxml import

多線程爬取百度百科

lib item put 腳本 mit sin find client rtl 前言：EVERNOTE裏的一篇筆記，我用了三個博客才學完...真的很菜...百度百科和故事網並沒有太過不一樣，修改下編碼，debug下，就可以爬下來了，不過應該是我爬的東西太初級了，而且我爬到

爬取糗事百科案例

from random import choice import requests import re user_agents=[ "User-Agent:Mozilla/5.0(Windows;U;WindowsNT6.1;en-us)AppleWebKit/534.50(KHT

Python爬蟲從入門到精通(3): BeautifulSoup用法總結及多執行緒爬蟲爬取糗事百科

本文是Python爬蟲從入門到精通系列的第3篇。我們將總結BeautifulSoup這個解析庫以及常用的find和select方法。我們還會利用requests庫和BeauitfulSoup來爬取糗事百科上的段子, 並對比下單執行緒爬蟲和多執行緒爬蟲的爬取效率。什麼是

使用threading,queue,fake_useragent,requests ,lxml,多執行緒爬取嗅事百科13頁文字資料,爬蟲案例

#author:huangtao # coding=utf-8 #多執行緒庫 from threading import Thread #佇列庫 from queue import Queue #請求庫 from fake_useragent import UserAgent

Python爬蟲-爬取糗事百科段子

hasattr com ima .net header rfi star reason images 閑來無事，學學python爬蟲。在正式學爬蟲前，簡單學習了下HTML和CSS，了解了網頁的基本結構後，更加快速入門。 1.獲取糗事百科url http://www.qiu

利用python爬取糗事百科的用戶及段子

我們什麽 roo urlopen gen 文件 addheader find 正則匹配最近正在學習python爬蟲，爬蟲可以做很多有趣的事，本文利用python爬蟲來爬取糗事百科的用戶以及段子，我們需要利用python獲取糗事百科一個頁面的用戶以及段子，就需要匹配兩次，

Python 爬取糗事百科段子

爬蟲 Python 百科段子直接上代碼 #!/usr/bin/env python # -*- coding: utf-8 -*- import re import urllib.request def gettext(url,page): headers=("User-Agen

scrapy框架爬蟲爬取糗事百科之 Python爬蟲從入門到放棄第不知道多少天（1）

Scrapy框架安裝及使用 1. windows 10 下安裝 Scrapy 框架：　　前提：安裝了python-pip 　　1. windows下按住win+R 輸入cmd 　　2. 在cmd 下輸入　　　　　　pip install scrapy 　　　　　　pip inst

Python :爬取糗事百科段子

原始碼： import urllib import random def JokeSet(Url,UserAgent) ''' Url ：動態url網址 UserAgent :動態請求頭 ''' #設定請求頭 Headers ={ "User-Agent" : UserAgent

requests爬取糗事百科頁面

requests爬取糗事百科,由於糗事百科是靜態頁面,用簡單的requests即可程式碼如下: import requests import lxml.html class Qiu: def __init__(self, name_, url_base): """

Python爬蟲爬取糗事百科(xpath+re)

爬取糗事百科，用xpath、re提取 =================================================== ===================================================== 1 ''' 2 爬取醜事百科，頁面

Scrapy框架的應用———爬取糗事百科檔案

專案主程式碼： 1 import scrapy 2 from qiushibaike.items import QiushibaikeItem 3 4 class QiubaiSpider(scrapy.Spider): 5 name = 'qiubai' 6

用BeautifulSoup爬取糗事百科段子

from bs4 import BeautifulSoup import lxml import requests import html import time import html5lib import re def crawl_joke_list_usebs4(pag

NO.33——XPath選擇器爬取糗事百科段子

程式碼實戰： # -*- coding:utf-8 -*- import urllib import requests import re import chardet from lxml import etree page = 2 url = 'ht

爬取糗事百科的頁面

import requests class QiuShiBaiKe(): def __init__(self): """ 初始化引數 """ self.url_bash = 'https://www.qiushibaike.

python爬取糗事百科資料並儲存到sqlite中，命令列讀出

import requests import sqlite3 from bs4 import BeautifulSoup class QSBK: def __init__(self): self.page=0 self.items=[

爬取糗事百科文欄位子，（2016年10月22日可用）

簡單的利用bs4提取了一些東西，中途嘗試了網上的多個版本，自己簡單的模仿了一下。主要提取部分： <a href="/article/117808662" target="_blank" cla

案例_(多線線程)爬取糗事百科

爬取效果如下:

相關推薦