線程和進程——python的多線程

阿新 • • 發佈：2018-03-03

threading dump == lxml ems 參考一個防止 pos

　　　　首先要分清楚這兩個概念。

　　　　進程：一個具有獨立功能的程序關於某個數據集合的一次運行活動。其一，它是一個實體；其二，是一個“執行中的程序”。

　　　　線程：進程裏包含的執行單元叫線程，一個進程可以包含多個線程。它是cpu的基本調度單位。

　　　　一個進程的內存空間是可以被它的線程共享的，但是一個線程在使用時，其它線程必須等待。通過“鎖”防止多個線程同時占用空間。

　　　　在不同線程同時訪問時，數據的保護機制是怎樣的呢？這就要提到python的一個“鎖”——GIL（全稱為全局解釋器鎖），要想利用多核系統，Python必須支持多線程運行。作為解釋型語言，Python的解釋器必須做到既安全又高效。我們都知道多線程編程會遇到的問題。解釋器要留意的是避免在不同的線程操作內部共享的數據。同時它還要保證在管理用戶線程時保證總是有最大化的計算資源。所以python就有了這麽一個“鎖”。這是一個讓人頭疼的問題，“鎖”的存在解決了那一些麻煩，但是也犧牲了python的多線程能力。

　　　　python的多線程適合於：大量密集的I/O處理

　　　　python的多進程：大量的密集並行計算

　　　　盡管python的多線程功能看起來比較雞肋，但是在爬蟲中的應用，還是可以提高效率的。

  1 import requests
  2 import threading    #使用線程庫
  3 from queue import Queue
  4 from lxml import etree
  5 import json
  6 import time
  7 
  8 
  9 class ThreadCrawl(threading.Thread):
 10     def 
 __init__(self,threadName,pageQueue,dataQueue):
 11 
 12         threading.Thread.__init__(self)
 13     #調用父類初始化方法
 14     #super(ThreadCrawl,self).__init__()
 15         self.threadName=threadName
 16         self.pageQueue=pageQueue
 17         self.dataQueue=dataQueue
 18         self.headers={" 
User-Agent":"Mozilla/5.0(Macintosh;IntelMacOSX10_7_0)AppleWebKit/535.11(KHTML,likeGecko)Chrome/17.0.963.56Safari/535.11"}
 19 
 20 
 21     def run(self):
 22          pass
 23                 self.dataQueue.put(content)
 24 
 25             except:
 26                 pass
 27         print("結束" + self.threadName)
 28 
 29 class ThreadParse(threading.Thread):
 30     def __init__(self,threadName,dataQueue,filename,lock):
 31         super(ThreadParse,self).__init__()
 32         self.threadName=threadName
 33         self.dataQueue=dataQueue
 34         self.filename=filename
 35         self.lock=lock
 36 
 37 
 38     def run(self):
 39         pass
 40 
 41     def parse(self,html):
 42         pass
 43         with self.lock:
 44             self.filename.write(json.dumps(items,ensure_ascii=False).encoding("utf-8") + "\n")
 45 
 46 
 47 
 48 
 49 grasp_exit=False
 50 parse_exit=False
 51 
 52 
 53 
 54 def main():
 55     #設置頁碼隊列
 56     pageQueue=Queue(20)
 57     #放入1-10個數字，按照隊列的先進先出原則
 58     for i in range(1,21):
 59         pageQueue.put(i)
 60 
 61     #采集結果的隊列，為空則表示無限制
 62     dataQueue=Queue()
 63     
 64     filename=open("lagou.json","a")
 65 
 66     #創建鎖
 67     lock=threading.Lock()
 68 
 69 
 70     #采集線程
 71     graspList=["采集線程1","采集線程2","采集線程3"]
 72     #存儲線程
 73     threadcrawl=[]
 74     for threadName in graspList:
 75         thread=ThreadCrawl(threadName,pageQueue,dataQueue)
 76         thread.start()
 77         threadcrawl.append(thread)
 78 
 79     #解析線程
 80     parseList=["解析線程1","解析線程2","解析線程3"]
 81     #存儲線程
 82     threadparse=[]
 83     for threadName in parseList:
 84         thread=ThreadParse(threadName,dataQueue,filename,lock)
 85         thread.start()
 86         threadparse.append(thread)
 87 
 88     while not pageQueue.empty():
 89         pass
 90 
 91 
 92     global grasp_exit
 93     grasp_exit=True
 94 
 95     print("隊列為空")
 96 
 97 
 98     for thread in threadcrawl:
 99         thread.join()
100 
101     while not dataQueue.empty():
102         pass
103 
104     global parse_exit
105     parse_exit=True
106 
107     for thread in threadparse:
108         thread.join()
109     with lock:
110         filename.close() 
111 if __name__=="__main__":
112     main()

　　　　上面是以拉勾網為例，寫了一個多線程。代碼不全，完整代碼參考我的github。效果如下：

技術分享圖片

　　　　多線程能提高的效率是有限的，後期會使用異步網絡框架如scrapy來提高爬蟲效率。

線程和進程——python的多線程

線程和進程——python的多線程

Python多線程和多進程誰更快？

Python多線程和進程

線程和進程——python的多線程

搞定python多線程和多進程

python多線程和多進程（一）

python多線程和多進程（二）

python多線程和多進程

python 多線程和多進程

python學習——day9（ssh,線程和進程，信號量，隊列，生產者消費者模型） Alex地址：http://www.cnblogs.com/alex3714/articles/5230609.html

Python多線程多進程

python多線程，多進程編程。

python - 多線程/多進程

python多線程編程-queue模塊和生產者-消費者問題

python 線程和進程概述

單線程和多線程執行對比—Python多線程編程

Python多線程，多進程，並行，並發，異步編程

進程和線程之間的概念以及多線程的優點

Python — 多線程與多進程

python--多線程&多進程

python--多線程多進程

線程和進程——python的多線程

相關推薦