1. 程式人生 > >多線程爬取百度百科

多線程爬取百度百科

lib item put 腳本 mit sin find client rtl

  • 前言:
    EVERNOTE裏的一篇筆記,我用了三個博客才學完...真的很菜...百度百科和故事網並沒有太過不一樣,修改下編碼,debug下,就可以爬下來了,不過應該是我爬的東西太初級了,而且我爬到3000多條鏈接時,好像被拒絕了...爬取速度也很慢,估計之後要接觸一些優化或者多進程,畢竟python是假的多線程。
    本博客參照代碼及PROJECT來源:http://kexue.fm/archives/4385/

  • 源代碼:
     1 #! -*- coding:utf-8 -*-
     2 import requests as rq
     3 import re
     4 import time
     5 import datetime
    
    6 from multiprocessing.dummy import Pool,Queue 7 import pymysql 8 from urllib import parse 9 import html 10 import importlib 11 from urllib.request import urlopen 12 from bs4 import BeautifulSoup 13 unescape = html.unescape #用來實現對HTML字符的轉移 14 15 tasks = Queue() 16 tasks_pass = set() #已隊列過的鏈接 17 tasks.put(
    http://baike.baidu.com/item/科學) 18 count = 0 #已爬取頁面總數 19 20 url_split_re = re.compile(&|\+) 21 def clean_url(url): 22 url = parse.urlparse(url) 23 return url_split_re.split(parse.urlunparse((url.scheme, url.netloc, url.path, ‘‘, ‘‘, ‘‘)))[0] 24 25 def main(): 26 global count,tasks_pass
    27 while True: 28 url = tasks.get() #取出一個url,並且在隊列中刪除掉 29 web = rq.get(url).content.decode(utf8,ignore) 30 urls = re.findall(uhref="(/item/.*?)", web) #查找所有站內鏈接 31 for u in urls: 32 try: 33 u = rq.get(u).content.decode(utf8) 34 except: 35 pass 36 u = http://baike.baidu.com + u 37 u = clean_url(u) 38 if (u not in tasks_pass): #把還沒有隊列過的鏈接加入隊列 39 tasks.put(u) 40 tasks_pass.add(u) 41 web1 = rq.get(u).content.decode(utf8, ignore) 42 bsObj = BeautifulSoup(web1, "lxml") 43 text = bsObj.title.get_text() 44 print(datetime.datetime.now(), , u, , text) 45 db = pymysql.connect("localhost", "testuser", "test123", "TESTDB", charset=utf8) 46 dbc = db.cursor() 47 sql = "insert ignore into baidubaike(url,title) values(%s,%s);" 48 data = (u, text) 49 dbc.execute(sql, data) 50 dbc.close() 51 db.commit() 52 count += 1 53 if count % 100 == 0: 54 print(u%s done. % count) 55 56 pool = Pool(4, main) #多線程爬取,4是線程數 57 total = 0 58 while True: #這部分代碼的意思是如果20秒內沒有動靜,那就結束腳本 59 time.sleep(60) 60 if len(tasks_pass) > total: 61 total = len(tasks_pass) 62 else: 63 break 64 65 pool.terminate() 66 print("terminated normally")


  • BUG:
    raise RemoteDisconnected("Remote end closed connection without" http.client.RemoteDisconnected: Remote end closed connection without response

    問題在於沒有偽裝請求頭

    來源:http://blog.csdn.net/u013424864/article/details/60778031



多線程爬取百度百科