python queue和多執行緒的爬蟲 與 JoinableQueue和多程序的爬蟲
阿新 • • 發佈:2019-01-09
多執行緒加queue的爬蟲 和 多程序加JoinableQueue的爬蟲
以自己的csdn部落格為例(捂臉,算不算刷自己部落格訪問量啊,哈哈哈)
這個是多執行緒爬蟲,程式碼比較簡單,有註釋:
結果耗時:# -*-coding:utf-8-*- """ ayou """ import requests from requests.exceptions import HTTPError, ConnectionError from bs4 import BeautifulSoup,NavigableString import Queue import threading import time #AyouBlog類 #get_page_url函式獲得所有部落格的URL class AyouBlog(): def __init__(self): self.headers = { "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:48.0) Gecko/20100101 Firefox/48.0", } self.s = requests.session() def get_page_url(self): urls_set = set() url="http://blog.csdn.net/u013055678?viewmode=contents" try: html = self.s.get(url, headers=self.headers) except HTTPError as e: print(str(e)) return str(e) except ConnectionError as e: print(str(e)) return str(e) try: soup = BeautifulSoup(html.content, "lxml") page_div = soup.find_all("span", {"class": "link_title"}) for url in page_div: a_url = "http://blog.csdn.net"+url.find("a").attrs["href"] urls_set.add(a_url) except AttributeError as e: print(str(e)) return str(e) return urls_set #ThreadUrl繼承執行緒類 #run函式將QUEUE中的URL逐個取出,然後開啟,取得部落格詳細頁面的標題 class ThreadUrl(threading.Thread): def __init__(self, queue): threading.Thread.__init__(self) self.queue = queue self.s = requests.session() self.headers = { "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:48.0) Gecko/20100101 Firefox/48.0", } def run(self): while not self.queue.empty(): host = self.queue.get() try: html = self.s.get(host, headers=self.headers) except HTTPError as e: print(str(e)) return str(e) except ConnectionError as e: print(str(e)) return str(e) try: soup = BeautifulSoup(html.content, "lxml") class_div = soup.find("span",{"class":"link_title"}) print((class_div.text).strip()) except AttributeError as e: print(str(e)) return str(e) except NavigableString as e: print(str(e)) return str(e) self.queue.task_done() def main(): #建立佇列 queue = Queue.Queue() #將URL放進佇列 p = AyouBlog() for url in p.get_page_url(): print(url) queue.put(url) #開多執行緒 for i in range(7): t = ThreadUrl(queue) t.setDaemon(True) t.start() #佇列清空後再執行其它 queue.join() if __name__=="__main__": start = time.time() main() print("Elapsed Time: %s" % (time.time() - start))
看一下只有1條執行緒所需要的時間
只需把main函式中 rang(7) 改成 range(1)
結果耗時在11秒左右
執行緒並不不是越多就越快,畢竟資料並不多,你會發現開4,5,6,7個耗時其實差不多都在3秒左右
少的資料上面的程式碼並沒有問題,但是我在工作中抓取三千多個頁面,六萬多條資料,用兩個queue,兩個執行緒組,每組開20個執行緒爬取,資料可以爬取下來,但是整個程序就死在那裡,也不報錯,無法正常退出,但是我只爬取幾百頁,幾千條資料卻能正常退出,沒有任何錯誤,這個問題我不知道怎麼回事兒,最後我只能用scrapy重寫一下,知道如何解決的朋友請告訴我,謝謝
下面是多程序爬蟲程式碼,有註釋:
結果耗時:# -*-coding:utf-8-*- """ ayou """ import requests from requests.exceptions import HTTPError, ConnectionError from bs4 import BeautifulSoup,NavigableString from multiprocessing import Process, JoinableQueue import time #AyouBlog類 #get_page_url函式獲得所有部落格的URL class AyouBlog(): def __init__(self): self.headers = { "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:48.0) Gecko/20100101 Firefox/48.0", } self.s = requests.session() def get_page_url(self): urls_set = set() url="http://blog.csdn.net/u013055678?viewmode=contents" try: html = self.s.get(url, headers=self.headers) except HTTPError as e: print(str(e)) return str(e) except ConnectionError as e: print(str(e)) return str(e) try: soup = BeautifulSoup(html.content, "lxml") page_div = soup.find_all("span", {"class": "link_title"}) for url in page_div: a_url = "http://blog.csdn.net"+url.find("a").attrs["href"] urls_set.add(a_url) except AttributeError as e: print(str(e)) return str(e) return urls_set #ThreadUrl繼承程序類 #run函式將JoinableQueue中的URL逐個取出,然後開啟,取得部落格詳細頁面的標題 class ThreadUrl(Process): def __init__(self, queue): super(ThreadUrl, self).__init__() self.queue = queue self.s = requests.session() self.headers = { "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:48.0) Gecko/20100101 Firefox/48.0", } def run(self): while not self.queue.empty(): host = self.queue.get() try: html = self.s.get(host, headers=self.headers) except HTTPError as e: print(str(e)) return str(e) except ConnectionError as e: print(str(e)) return str(e) try: soup = BeautifulSoup(html.content, "lxml") class_div = soup.find("span",{"class":"link_title"}) print((class_div.text).strip()) except AttributeError as e: print(str(e)) return str(e) except NavigableString as e: print(str(e)) return str(e) self.queue.task_done() def main(): #程序列表 worker_list = list() #建立佇列 queue = JoinableQueue() #將URL放進佇列 p = AyouBlog() for url in p.get_page_url(): print(url) queue.put(url) #開多程序 for i in range(3): t = ThreadUrl(queue) worker_list.append(t) t.start() #佇列清空後再執行其它 queue.join() #程序關閉(這個是不是多餘啊?) for w in worker_list: w.terminate() if __name__=="__main__": start = time.time() main() print("Elapsed Time: %s" % (time.time() - start))
程序執行速度比較快,但是程序的開銷比較大
還是上面說到的問題,少的資料上面的程式碼並沒有問題,但是六萬多條資料,用兩個JoinableQueue,兩個程序組,每組開5個程序爬取,資料可以爬取下來,但是所有程序就死在那裡,也不報錯,無法正常退出,但是我只爬取幾百頁,幾千條資料卻能正常退出,沒有任何錯誤,這個問題我不知道怎麼回事兒
我加斷點啥也沒看出啥問題,不知道是queue和JoinableQueue最後沒有空,還是空了以後多執行緒多程序還在搶著執行然後死了,鬧不明白
有誰知道怎麼回事兒,怎麼解決,請告訴我一下,謝謝