1. 程式人生 > >爬蟲學習之14:多程序爬取簡書社會熱點資料儲存到mongodb

爬蟲學習之14:多程序爬取簡書社會熱點資料儲存到mongodb

   本程式碼爬取簡書社會熱點欄目10000頁的資料,使用多程序方式爬取,從簡書網頁可以看出,網頁使用了非同步載入,頁碼只能從response中推測出來,從而構造url,直接上程式碼:

import requests
from lxml import etree
import pymongo
from multiprocessing import Pool
import time

headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36'
}

client = pymongo.MongoClient('localhost',27017)
mydb = client['mydb']
jianshu = mydb['jianshu_2']
num = 0
def get_jianshu_info(url):
    global num
    html = requests.get(url,headers=headers)
    selector = etree.HTML(html.text)
    infos = selector.xpath('//ul[@class="note-list"]/li')
    for info in infos:
        try:
            author = info.xpath('div/div/a[1]/text()')[0]
            title = info.xpath('div/a/text()')[0]
            abstract = info.xpath('div/p/text()')[0]
            comment = info.xpath('div/div/a[2]/text()')[1].strip()
            like = info.xpath('div/div/span/text()')[0].strip()
            data = {
                'author':author,
                'title':title,
                'abstract':abstract,
                'comment':comment,
                'like':like
            }
            jianshu.insert_one(data)
            num = num +1
            print("已爬取第{}條資訊".format(str(num)))
        except IndexError:
            pass

if __name__=='__main__':
    urls = ['https://www.jianshu.com/c/20f7f4031550?utm_medium=index-collections&utm_source=desktop&page={}'.format(str(i)) for i in range(1,10000)]
    pool = Pool(processes=8)
    start_time = time.time()
    pool.map(get_jianshu_info,urls)
    end_time = time.time()
    print("八程序爬蟲耗費時間:", end_time - start_time)

可以看到爬取的資訊已經儲存到了mongodb中: