爬蟲_古詩文網(隊列，多線程，鎖，正則，xpath)

阿新 • • 發佈：2018-08-11

.get like type http pre stat apple writer except

 1 import requests
 2 from queue import Queue
 3 import threading
 4 from lxml import etree
 5 import re
 6 import csv
 7 
 8 
 9 class Producer(threading.Thread):
10     headers = {‘User-Agent‘: ‘Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36 
‘}
11     def __init__(self, page_queue, poem_queue, *args, **kwargs):
12         super(Producer, self).__init__(*args, **kwargs)
13         self.page_queue = page_queue
14         self.poem_queue = poem_queue
15 
16 
17     def run(self):
18         while True:
19             if self.page_queue.empty():
 
20                 break
21             url = self.page_queue.get()
22             self.parse_html(url)
23 
24 
25     def parse_html(self, url):
26         # poems = []
27         headers = {‘User-Agent‘: ‘Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.75 Safari/537.36 
‘}
28         response = requests.get(url, headers=headers)
29         response.raise_for_status()
30         html = response.text
31         html_element = etree.HTML(html)
32         titles = html_element.xpath(‘//div[@class="cont"]//b/text()‘)
33         contents = html_element.xpath(‘//div[@class="contson"]‘)
34         hrefs = html_element.xpath(‘//div[@class="cont"]/p[1]/a/@href‘)
35         for index, content in enumerate(contents):
36             title = titles[index]
37             content = etree.tostring(content, encoding=‘utf-8‘).decode(‘utf-8‘)
38             content = re.sub(r‘<.*?>|\n|‘, ‘‘, content)
39             content = re.sub(r‘\u3000\u3000‘, ‘‘, content)
40             content = content.strip()
41             href = hrefs[index]
42             self.poem_queue.put((title, content, href))
43 
44 
45 class Consumer(threading.Thread):
46 
47     def __init__(self, poem_queue, writer, gLock, *args, **kwargs):
48         super(Consumer, self).__init__(*args, **kwargs)
49         self.writer = writer
50         self.poem_queue = poem_queue
51         self.lock = gLock
52 
53     def run(self):
54         while True:
55             try:
56                 title, content, href = self.poem_queue.get(timeout=20)
57                 self.lock.acquire()
58                 self.writer.writerow((title, content, href))
59                 self.lock.release()
60             except:
61                 break
62 
63 
64 def main():
65     page_queue = Queue(100)
66     poem_queue = Queue(500)
67     gLock = threading.Lock()
68     fp = open(‘poem.csv‘, ‘a‘,newline=‘‘, encoding=‘utf-8‘)
69     writer = csv.writer(fp)
70     writer.writerow((‘title‘, ‘content‘, ‘href‘))
71    
72 
73     for x in range(1, 100):
74         url = ‘https://www.gushiwen.org/shiwen/default.aspx?page=%d&type=0&id=0‘ % x
75         page_queue.put(url)
76 
77     for x in range(5):
78         t = Producer(page_queue, poem_queue)
79         t.start()
80 
81     for x in range(5):
82         t = Consumer(poem_queue, writer, gLock)
83         t.start()
84 
85 if __name__ == ‘__main__‘:
86     main()

運行結果

爬蟲_古詩文網(隊列，多線程，鎖，正則，xpath)

.get like type http pre stat apple writer except 1 import requests 2 from queue import Queue 3 import threading 4 from lxml

『TensorFlow』隊列&多線程&TFRecod文件_我輩當高歌

gradient 函數 http who epo variable nbsp 其他新建 TF數據讀取隊列機制詳解 TFR文件多線程隊列讀寫操作： TFRecod文件寫入操作： import tensorflow as tf def _in

網絡編程——同一進程中的隊列（多線程）

字符串所有優先級優先級隊列當前進程字符網絡編程表示 import queue queue.Queue() 先進先出 queue.LifoQueue() 後進先出 queue.PriorityQueue() 優先級隊列　　優先級隊列 q = queue.Pr

TensorFlow筆記（7）-----實戰Google深度學習框架----隊列與多線程

一起 width nbsp stop 之前圖片第一個 queue enqueue 一、創建一個隊列： FIFOQueue：先進先出 RandomShuffleQueue：會將隊列中的元素打亂，每次出列操作得到的是從當前隊列所有元素中隨機選擇的一個。二、操作一個隊列的函

Python隊列與多線程及文件鎖

元素 .data 就會 python col 執行混亂 pre join() 隊列實現生產-多線程消費先看代碼 # -*- coding: utf-8 -*- import queue import threading mu = threading.Lock()

Tensorflow多線程輸入數據處理框架（一）——隊列與多線程

cast 支持 oop soft dom 集合 run 列操作 start 參考書《TensorFlow：實戰Google深度學習框架》（第2版）對於隊列，修改隊列狀態的操作主要有Enqueue、EnqueueMany和Dequeue。以下程序展示了如何使用這些函數來

為什麽python的多線程不能利用多核CPU，但是咱們在寫代碼的時候，多線程的確是在並發，而且還比單線程快。

全局睡眠 read 處理 sleep roc 需要寫代碼強制 python裏的多線程是單cpu意義上的多線程，它和多cpu上的多線程有著本質的區別。單cpu多線程：並發多cpu多線程：並行內部包含並發首先強調背景： 1、GIL是什麽？GIL的全稱是Gl

爬蟲_鬥圖啦(隊列，多線程)

produce rom return range while rod 爬蟲 put 2.0 1 import threading 2 import requests 3 from lxml import etree 4 from urllib import

Python開發基礎--- Event對象、隊列和多進程基礎

mina ces 停止阻塞隊列 con timeout 子進程 pri consumer Event對象用於線程間通信，即程序中的其一個線程需要通過判斷某個線程的狀態來確定自己下一步的操作，就用到了event對象 event對象默認為假（Flase），即遇到event對

網卡速率變化導致paramiko模塊timeout的失效，多線程超時控制解決辦法。

context .com 判斷 cep util sha fff fail stdout 起因：上周給幾個集群的機器升級軟件包，每個集群大概兩千臺服務器，但是在軟件發包和批量執行命令的過程中有兩個集群都遇到了問題，在批量執行命令的時候總是會在後面卡住久久不能退出

iOS多線程開發之NSOperation - 快上車，沒時間解釋了！

ddt null sleep main set ask 多個 ops exec 一、什麽是NSOperation？ NSOperation是蘋果提供的一套多線程解決方案。實際上NSOperation是基於GCD更高一層的封裝，但是比GCD更加的面向對象、代碼可讀

多線程中sleep和wait的區別，以及多線程的實現方式及原因，定時器--Timer

守護驗證取消技術方法代碼安全接口 art 1. Java中sleep和wait的區別 ① 這兩個方法來自不同的類分別是，sleep來自Thread類，和wait來自Object類。 sleep是Thread的靜態類方法，誰調用的誰去睡覺，即使在a線程裏調用b

C#多線程のSemaphore（信號量，負責協調各個線程）

csharp init true 控制 line 執行方法 start com 執行 Semaphore負責協調線程，可以限制對某一資源訪問的線程數量這裏對SemaphoreSlim類的用法做一個簡單的例子： namespace WpfApplication6 {

day11(多線程,喚醒機制，生產消費者模式，多線程的生命周期)

i++ 一個 false -- 輸出結果 ets exti tar ++ A:進程：　　　　進程指正在運行的程序。確切的來說，當一個程序進入內存運行，即變成一個進程，進程是處於運行過程中的程序，並且具有一定獨立功能。 B:線程：　　　　線程是進程中的一個執行單元，負責

python 簡單搭建阻塞式單進程，多進程，多線程服務

ets args oca 多線程 accept 客戶端連接 def read div 我們可以通過這樣子的方式去理解apache的工作原理 1 單進程TCP服務（堵塞式）　　這是最原始的服務，也就是說只能處理個客戶端的連接，等當前客戶端關閉後，才能處理下個客戶端，是屬於阻

線程vs進程，多線程實例

class alt 唱歌優缺點 get 定義 yellow 單位 range 進程VS線程功能進程，能夠完成多任務，比如在一臺電腦上能夠同時運行多個QQ 線程，能夠完成多任務，比如一個QQ中的多個聊天窗口定義的不同進程是系統進行資

13，多線程-生產者消費者問題2

sign post cep pre int all ren test pac 關鍵代碼1 private Lock lock = new ReentrantLock(); private Condition condition_pro = lock.newConditi

python錯誤和異常，re模塊，多線程，paramiko模塊

pin 循環列表 use 可能一起 down get mman tom 文件操作x=open(‘/etc/hosts‘) ###默認讀的方式打開x.readline()x.read()x.seek(0)y=open(‘/root/new.txt‘,‘w‘)y.writ

Java學習筆記—多線程（並發工具類，java.util.concurrent.atomic包）

配對初始訪問接收 iter nco .get 執行 string 在JDK的並發包裏提供了幾個非常有用的並發工具類。CountDownLatch、CyclicBarrier和Semaphore工具類提供了一種並發流程控制的手段，Exchanger工具類則提供了在線程間

Python 端口掃描（全連接掃描，多線程）

Python 端口掃描多線程 from socket import * import threading #導入線程相關模塊 lock = threading.Lock() openNum = 0 threads = [] #定義線程列表 def port

爬蟲_古詩文網(隊列，多線程，鎖，正則，xpath)

相關推薦