1. 程式人生 > >分散式爬蟲之celery

分散式爬蟲之celery

以爬douban小說為例
首先啟動Redis,新建檔案crawl_douban.py

import requests
from bs4 import BeautifulSoup
import time
from celery import Celery
import redis
from configparser import ConfigParser

cp=ConfigParser()
cp.read('config')

#獲取配置資訊
db_host=cp.get(section='redis',option='db_host')
db_port=cp.getint('redis'
,'db_port') db_pwd=cp['redis']['db_pwd'] #redis連線 pool = redis.ConnectionPool(host=db_host, port=db_port, db=15, password=db_pwd) r = redis.StrictRedis(connection_pool=pool) set_name='crawl:douban' app = Celery('crawl', include=['task'], broker='redis://:{}@{}:{}/12'.format(db_pwd,db_host,db_port), backend='redis://:{}@{}:{}/13'
.format(db_pwd,db_host,db_port)) # 官方推薦使用json作為訊息序列化方式 app.conf.update( CELERY_TIMEZONE='Asia/Shanghai', CELERY_ENABLE_UTC=True, CELERY_ACCEPT_CONTENT=['json'], CELERY_TASK_SERIALIZER='json', CELERY_RESULT_SERIALIZER='json', ) headers={ 'User-Agent':'', } @app.task def crawl
(url):
res=requests.get(url,headers=headers) #延遲2秒 time.sleep(2) soup=BeautifulSoup(res.text,'lxml') items=soup.select('.subject-list .subject-item .info h2 a') titles=[item['title'] for item in items] #將小說的title存入redis資料庫 r.sadd(set_name,(url,titles,time.time())) print(titles) return (url,titles)

將上面的指令碼部署到兩臺主機A和B,然後各自執行下面的命令:

celery -A crawl_douban worker -l info

在本機C新建檔案task_dispatcher.py用於非同步分發任務,程式碼如下:

from crawl_douban import app
from crawl_douban import crawl

def manage_crawl(urls):
    for url in urls:
        app.send_task('crawl_douban.crawl', args=(url,))
        #上句也可以寫成 crawl.apply_async(args=(url,)) 或 crawl.delay(url)

if __name__ == '__main__':
    start_url = 'https://book.douban.com/tag/小說'
    #爬去10頁,每頁20本書
    url_list = ['{}?start={}&type=T'.format(start_url, page * 20) for page in range(10)]
    manage_crawl(url_list)

執行task_dispatcher.py,跑完用時2.8s

celery worker -A tasks --loglevel=info --concurrency=5
  1. 引數”-A”指定了Celery例項的位置
  2. 引數”loglevel”指定了日誌等級,也可以不加,預設為warning。
  3. 引數”concurrency”指定最大併發數,預設為CPU核數。
[program:celery]
command=celery worker -A tasks --loglevel=info --concurrency=5
directory=/home/user_00/learn
stdout_logfile=/home/user_00/learn/logs/celery.log
autorestart=true
redirect_stderr=true