第五篇 scrapy安裝及目錄結構,啟動spider項目
實際上安裝scrapy框架時,需要安裝很多依賴包,因此建議用pip安裝,這裏我就直接使用pycharm的安裝功能直接搜索scrapy安裝好了。
然後進入虛擬環境創建一個scrapy工程:
(third_project) [email protected]:~/python_file/python_project/pachong$ scr scrapy screendump script scriptreplay (third_project) [email protected]:~/python_file/python_project/pachong$ scrapy startproject ArticleSpider New Scrapy project‘ArticleSpider‘, using template directory ‘/home/bigni/.virtualenvs/third_project/lib/python3.5/site-packages/scrapy/templates/project‘, created in: /home/bigni/python_file/python_project/pachong/ArticleSpider You can start your first spider with: cd ArticleSpider scrapy genspider example example.com (third_project) [email protected]:~/python_file/python_project/pachong$
我用pycharm進入創建好的scrapy項目,這個目錄結構比較簡單,而且有些地方很像Django
Spiders文件夾:我們可以在Spiders文件夾下編寫我們的爬蟲文件,裏面主要是用於分析response並提取返回的item或者是下一個URL信息,每個Spider負責處理特定的網站或一些網站。__init__.py:項目的初始化文件。
items.py:通過文件的註釋我們了解到這個文件的作用是定義我們所要爬取的信息的相關屬性。Item對象是種容器,用來保存獲取到的數據。
middlewares.py:Spider中間件,在這個文件裏我們可以定義相關的方法,用以處理蜘蛛的響應輸入和請求輸出。
pipelines.py:在item被Spider收集之後,就會將數據放入到item pipelines中,在這個組件是一個獨立的類,他們接收到item並通過它執行一些行為,同時也會決定item是否能留在pipeline,或者被丟棄。
settings.py:提供了scrapy組件的方法,通過在此文件中的設置可以控制包括核心、插件、pipeline以及Spider組件。
創建爬蟲模板:
好比在Django中創建一個APP,在次創建一個爬蟲
命令:
#註意:必須在該工程目錄下
#創建一個名字為blogbole,爬取root地址為blog.jobbole.com 的爬蟲;爬伯樂在線
scrapy genspider jobbole blog.jobbole.com
(third_project) [email protected]:~/python_file/python_project/pachong/ArticleSpider/ArticleSpider$ scrapy genspider jobbole blog.jobbole.com Created spider ‘jobbole‘ using template ‘basic‘ in module: ArticleSpider.spiders.jobbole
然後可以在spiders裏看到新生成的文件jobbole.py
jobbole.py的內容如下:
# -*- coding: utf-8 -*- import scrapy class JobboleSpider(scrapy.Spider): # 爬蟲名字 name = ‘jobbole‘ # 運行爬取的域名 allowed_domains = [‘blog.jobbole.com‘] # 開始爬取的URL start_urls = [‘http://blog.jobbole.com/‘] # 爬取函數 def parse(self, response): pass
在終端上執行 :scrapy crawl jobbole 測試下,其中jobbole是spidername
(third_project) [email protected]:~/python_file/python_project/pachong/ArticleSpider/ArticleSpider$ scrapy crawl jobbole 2017-08-27 22:24:21 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: ArticleSpider) 2017-08-27 22:24:21 [scrapy.utils.log] INFO: Overridden settings: {‘BOT_NAME‘: ‘ArticleSpider‘, ‘NEWSPIDER_MODULE‘: ‘ArticleSpider.spiders‘, ‘ROBOTSTXT_OBEY‘: True, ‘SPIDER_MODULES‘: [‘ArticleSpider.spiders‘]} 2017-08-27 22:24:21 [scrapy.middleware] INFO: Enabled extensions: [‘scrapy.extensions.logstats.LogStats‘, ‘scrapy.extensions.corestats.CoreStats‘, ‘scrapy.extensions.memusage.MemoryUsage‘, ‘scrapy.extensions.telnet.TelnetConsole‘] 2017-08-27 22:24:21 [scrapy.middleware] INFO: Enabled downloader middlewares: [‘scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware‘, ‘scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware‘, ‘scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware‘, ‘scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware‘, ‘scrapy.downloadermiddlewares.useragent.UserAgentMiddleware‘, ‘scrapy.downloadermiddlewares.retry.RetryMiddleware‘, ‘scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware‘, ‘scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware‘, ‘scrapy.downloadermiddlewares.redirect.RedirectMiddleware‘, ‘scrapy.downloadermiddlewares.cookies.CookiesMiddleware‘, ‘scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware‘, ‘scrapy.downloadermiddlewares.stats.DownloaderStats‘] 2017-08-27 22:24:21 [scrapy.middleware] INFO: Enabled spider middlewares: [‘scrapy.spidermiddlewares.httperror.HttpErrorMiddleware‘, ‘scrapy.spidermiddlewares.offsite.OffsiteMiddleware‘, ‘scrapy.spidermiddlewares.referer.RefererMiddleware‘, ‘scrapy.spidermiddlewares.urllength.UrlLengthMiddleware‘, ‘scrapy.spidermiddlewares.depth.DepthMiddleware‘] 2017-08-27 22:24:21 [scrapy.middleware] INFO: Enabled item pipelines: [] 2017-08-27 22:24:21 [scrapy.core.engine] INFO: Spider opened 2017-08-27 22:24:21 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2017-08-27 22:24:21 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2017-08-27 22:24:21 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://blog.jobbole.com/robots.txt> (referer: None) 2017-08-27 22:24:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://blog.jobbole.com/> (referer: None) 2017-08-27 22:24:26 [scrapy.core.engine] INFO: Closing spider (finished) 2017-08-27 22:24:26 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {‘downloader/request_bytes‘: 438, ‘downloader/request_count‘: 2, ‘downloader/request_method_count/GET‘: 2, ‘downloader/response_bytes‘: 22537, ‘downloader/response_count‘: 2, ‘downloader/response_status_count/200‘: 2, ‘finish_reason‘: ‘finished‘, ‘finish_time‘: datetime.datetime(2017, 8, 27, 14, 24, 26, 588459), ‘log_count/DEBUG‘: 3, ‘log_count/INFO‘: 7, ‘memusage/max‘: 50860032, ‘memusage/startup‘: 50860032, ‘response_received_count‘: 2, ‘scheduler/dequeued‘: 1, ‘scheduler/dequeued/memory‘: 1, ‘scheduler/enqueued‘: 1, ‘scheduler/enqueued/memory‘: 1, ‘start_time‘: datetime.datetime(2017, 8, 27, 14, 24, 21, 136475)} 2017-08-27 22:24:26 [scrapy.core.engine] INFO: Spider closed (finished) (third_project) [email protected]:~/python_file/python_project/pachong/ArticleSpider/ArticleSpider$
在pycharm 調試scrapy 執行流程,在項目裏創建一個python文件,把項目路徑寫進sys.path,調用execute方法,傳個數組,效果和上面在終端運行效果一樣
PS:setting配置文件裏建議把robotstxt協議停掉,否則scrapy會過濾某些url
# Obey robots.txt rules ROBOTSTXT_OBEY = False
第五篇 scrapy安裝及目錄結構,啟動spider項目