python 多線程糗事百科案例

阿新 • • 發佈：2017-08-23

wow64 案例 sts ascii starting 頁面 don 示意圖 utf-8

案例要求參考上一個糗事百科單進程案例

Queue（隊列對象）

Queue是python中的標準庫，可以直接import Queue引用;隊列是線程間最常用的交換數據的形式

python下多線程的思考

對於資源，加鎖是個重要的環節。因為python原生的list,dict等，都是not thread safe的。而Queue，是線程安全的，因此在滿足使用條件下，建議使用隊列

初始化： class Queue.Queue(maxsize) FIFO 先進先出
包中的常用方法:
- Queue.qsize() 返回隊列的大小
- Queue.empty() 如果隊列為空，返回True,反之False
- Queue.full() 如果隊列滿了，返回True,反之False
- Queue.full 與 maxsize 大小對應
- Queue.get([block[, timeout]])獲取隊列，timeout等待時間
創建一個“隊列”對象
- import Queue
- myqueue = Queue.Queue(maxsize = 10)
將一個值放入隊列中
- myqueue.put(10)
將一個值從隊列中取出
- myqueue.get()

多線程示意圖

技術分享

# -*- coding:utf-8 -*-
import requests
from lxml import etree
from Queue import Queue
import threading
import time
import json


class thread_crawl(threading.Thread):
    ‘‘‘
    抓取線程類
    ‘‘‘

    def __init__(self, threadID, q):
        threading.Thread.__init__(self)
        self.threadID = threadID
        self.q = q

    def run(self):
        print "Starting " + self.threadID
        self.qiushi_spider()
        print "Exiting ", self.threadID

    def qiushi_spider(self):
        # page = 1
        while True:
            if self.q.empty():
                break
            else:
                page = self.q.get()
                print ‘qiushi_spider=‘, self.threadID, ‘,page=‘, str(page)
                url = ‘http://www.qiushibaike.com/8hr/page/‘ + str(page) + ‘/‘
                headers = {
                    ‘User-Agent‘: ‘Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36‘,
                    ‘Accept-Language‘: ‘zh-CN,zh;q=0.8‘}
                # 多次嘗試失敗結束、防止死循環
                timeout = 4
                while timeout > 0:
                    timeout -= 1
                    try:
                        content = requests.get(url, headers=headers)
                        data_queue.put(content.text)
                        break
                    except Exception, e:
                        print ‘qiushi_spider‘, e
                if timeout < 0:
                    print ‘timeout‘, url


class Thread_Parser(threading.Thread):
    ‘‘‘
    頁面解析類；
    ‘‘‘

    def __init__(self, threadID, queue, lock, f):
        threading.Thread.__init__(self)
        self.threadID = threadID
        self.queue = queue
        self.lock = lock
        self.f = f

    def run(self):
        print ‘starting ‘, self.threadID
        global total, exitFlag_Parser
        while not exitFlag_Parser:
            try:
                ‘‘‘
                調用隊列對象的get()方法從隊頭刪除並返回一個項目。可選參數為block，默認為True。
                如果隊列為空且block為True，get()就使調用線程暫停，直至有項目可用。
                如果隊列為空且block為False，隊列將引發Empty異常。
                ‘‘‘
                item = self.queue.get(False)
                if not item:
                    pass
                self.parse_data(item)
                self.queue.task_done()
                print ‘Thread_Parser=‘, self.threadID, ‘,total=‘, total
            except:
                pass
        print ‘Exiting ‘, self.threadID

    def parse_data(self, item):
        ‘‘‘
        解析網頁函數
        :param item: 網頁內容
        :return:
        ‘‘‘
        global total
        try:
            html = etree.HTML(item)
            result = html.xpath(‘//div[contains(@id,"qiushi_tag")]‘)
            for site in result:
                try:
                    imgUrl = site.xpath([email protected]

/*  */)[0]
                    title = site.xpath(‘.//h2‘)[0].text
                    content = site.xpath(‘.//div[@class="content"]/span‘)[0].text.strip()
                    vote = None
                    comments = None
                    try:
                        vote = site.xpath(‘.//i‘)[0].text
                        comments = site.xpath(‘.//i‘)[1].text
                    except:
                        pass
                    result = {
                        ‘imgUrl‘: imgUrl,
                        ‘title‘: title,
                        ‘content‘: content,
                        ‘vote‘: vote,
                        ‘comments‘: comments,
                    }

                    with self.lock:
                        # print ‘write %s‘ % json.dumps(result)
                        self.f.write(json.dumps(result, ensure_ascii=False).encode(‘utf-8‘) + "\n")

                except Exception, e:
                    print ‘site in result‘, e
        except Exception, e:
            print ‘parse_data‘, e
        with self.lock:
            total += 1

data_queue = Queue()
exitFlag_Parser = False
lock = threading.Lock()
total = 0

def main():
    output = open(‘qiushibaike.json‘, ‘a‘)

    #初始化網頁頁碼page從1-10個頁面
    pageQueue = Queue(50)
    for page in range(1, 11):
        pageQueue.put(page)

    #初始化采集線程
    crawlthreads = []
    crawlList = ["crawl-1", "crawl-2", "crawl-3"]

    for threadID in crawlList:
        thread = thread_crawl(threadID, pageQueue)
        thread.start()
        crawlthreads.append(thread)

    #初始化解析線程parserList
    parserthreads = []
    parserList = ["parser-1", "parser-2", "parser-3"]
    #分別啟動parserList
    for threadID in parserList:
        thread = Thread_Parser(threadID, data_queue, lock, output)
        thread.start()
        parserthreads.append(thread)

    # 等待隊列清空
    while not pageQueue.empty():
        pass

    # 等待所有線程完成
    for t in crawlthreads:
        t.join()

    while not data_queue.empty():
        pass
    # 通知線程是時候退出
    global exitFlag_Parser
    exitFlag_Parser = True

    for t in parserthreads:
        t.join()
    print "Exiting Main Thread"
    with lock:
        output.close()


if __name__ == ‘__main__‘:
    main()

python 多線程糗事百科案例

wow64 案例 sts ascii starting 頁面 don 示意圖 utf-8 案例要求參考上一個糗事百科單進程案例 Queue（隊列對象） Queue是python中的標準庫，可以直接import Queue引用;隊列是線程間最常用的交換數據的形式 python

Python爬蟲(十八)_多線程糗事百科案例

.json afa 安全 rip down 退出交互 encode tar 多線程糗事百科案例案例要求參考上一個糗事百科單進程案例:http://www.cnblogs.com/miqi1992/p/8081929.html Queue(隊列對象) Queue是pyth

多線程糗事百科案例

一個 tag except 入隊 run cep thread ont global Queue（隊列對象） Queue是python中的標準庫，可以直接import Queue引用;隊列是線程間最常用的交換數據的形式 python下多線程的思考對於資源，加鎖是個重要的環

Python爬蟲(十八)_多執行緒糗事百科案例

多執行緒糗事百科案例案例要求參考上一個糗事百科單程序案例:http://www.cnblogs.com/miqi1992/p/8081929.html Queue(佇列物件) Queue是python中的標準庫，可以直接import Queue引用；佇列時執行緒間最常用的互動資料的形式。 pytho

Python爬蟲(十七)_糗事百科案例

exce html str window path {} zh-cn use src 糗事百科實例爬取糗事百科段子，假設頁面的URL是: http://www.qiushibaike.com/8hr/page/1 要求：使用requests獲取頁面信息，用XPath/

day-3 聊聊python多線程編程那些事

獲取鎖垃圾清理 sum() gif 機制 isp 時間 .com 技術分享　　python一開始給我的印象是容易入門，適合應用開發，編程簡潔，第三方庫多等等諸多優點，並吸引我去深入學習。直到學習完多線程編程，在自己環境上驗證完這句話：python解釋器引入GIL鎖以

python 遠程批量多線程paramiko 和 threading案例

man 技術分享 main 分享 str ces 就是圖片 target 初步理解多線程的好處這兩個例子告訴我們同樣的事情，一個用了8s一個用了5s這就是多線程並發執行的好處。 paramiko 和 threading 多線程遠程執行的基本案例--[root@scsv0

python 多線程並發threading & 任務隊列Queue

不同 htm doc threading 阻塞子線程 per 出現 bag https://docs.python.org/3.7/library/concurrency.htmlpython程序默認是單線程的，也就是說在前一句語句執行完之前後面的語句不能繼續執行先感受一

Python多線程編程

多線程、thread、生產者/消費者問題一個串行程序需要從每個I/O終端通道來檢測用戶的輸入，然而程序在讀取過程中不能阻塞，因為用戶輸入的到達時間的不確定，並且阻塞會妨礙其他I/O通道的處理。由於串行程序只有唯一的執行線程，因此它需要兼顧執行的多個任務，確保其中的某個任務不會占用過多的時間，並對用戶的響應

Python爬蟲-爬取糗事百科段子

hasattr com ima .net header rfi star reason images 閑來無事，學學python爬蟲。在正式學爬蟲前，簡單學習了下HTML和CSS，了解了網頁的基本結構後，更加快速入門。 1.獲取糗事百科url http://www.qiu

python 多線程探索

額外 java 技術分享彌補目前 count 兩個 ava 是不是前面已經了解過了，python多線程效率較低的主要原因是存在GIL，即Global Interpreter Lock(全局解釋器鎖）。這裏繼續詳細的看下GIL的說明與如何避免GIL的影響，從而提高pyt

python 多線程

阻塞 ads coo range wait def true mina 將在 Lock對比Rlock #coding:utf-8 import threading lock = threading.Lock() #Lock對象 lock.acquire() lock.

Python多線程Selenium跨瀏覽器測試

items 支持 chrome 定義 -c name 機器 quit http 如何執行跨瀏覽器測試如果我們使用selenium webdriver，那我們就能夠自動的在IE、firefox、chrome、等不同瀏覽器上運行測試用例。為了能在同一臺機器上不同瀏覽器上同時

python -- 多線程編程

res iss 默認按鈕 turn exe 能夠內存並不是多線程類似於同時執行多個不同程序，多線程運行有如下優點：使用線程可以把占據長時間的程序中的任務放到後臺去處理。用戶界面可以更加吸引人，這樣比如用戶點擊了一個按鈕去觸發某些事件的處理，可以彈出一個進度條來

Python多線程和多進程誰更快？

-s roc finally scan lis fun import 行鎖 sys python多進程和多線程誰更快 python3.6 threading和multiprocessing 四核+三星250G-850-SSD 自從用多進程和多線程進行編程,一致沒搞懂到

python多線程測試接口性能

form tar ces logs 耗時 phone hone com glob import requests import json import threading import time # 定義請求基本地址 base_url = "http://12

python-多線程3-生產者消費者

reading run cnblogs eas cond 多線程 rod con range ‘‘‘生產者和消費者‘‘‘ ‘‘‘ 用python寫一個多線程的生產者和消費者生產者x x>0,有東西,print(不生產) x=0,沒東西,print(生產) for循

python-多線程1

ron 運行生成 -s style 們的 read 對象可能程序\進程\線程的關系: 程序(program) 　　一組功能集合的靜態描述,程序至少有一個進程進程(process) 　　進程是系統進行資源分配和調度的,他們擁有自己獨立的空間,進程至少有一個線程線程(

Python 多線程同步隊列模型

並且 highlight 多線程 use lib star 保存 env module Python 多線程同步隊列模型我面臨的問題是有個非常慢的處理邏輯（比如分詞、句法），有大量的語料，想用多線程來處理。這一個過程可以抽象成一個叫“同步隊列”的模型。具體來講，有

PYTHON多線程

art join() targe end 時間 import for range 多線程　　在單線程的情況下，程序是逐條指令順序執行的。同一時間只做一個任務，完成了一個任務再進行下一個任務。比如有5個人吃飯，單線程一次只允許一個人吃，一個人吃完了另一個人才能接著吃，假如每

python 多線程糗事百科案例

Queue（隊列對象）

多線程示意圖

相關推薦