scrapy爬取58同城二手房問題與對策
測試環境:
win10,單機爬取,scrapy1.5.0,python3.6.4,mongodb,Robo 3T
其他準備:
代理池:測試環境就沒有用搭建的flask抓代理,因為我找到的幾個免費網站有效ip不夠多,因此從xxx網站批量獲取了800多個免費https代理,然後開線程池測試訪問58同城網站,將有效ip保存到json文本中,在scrapy代碼加proxy中間件,每次從json中random一個代理;
請求頭:網上搜集各種網站的User-Agent,在scrapy中加UserAgent中間件,每次請求random一個UserAgent;
settings.py:
BOT_NAME = ‘oldHouse‘ SPIDER_MODULES = [‘oldHouse.spiders‘] NEWSPIDER_MODULE = ‘oldHouse.spiders‘ ROBOTSTXT_OBEY = False DOWNLOAD_DELAY=1 RETRY_TIMES = 8 MONGO_URI = ‘localhost‘ MONGO_DATABASE = ‘old58House‘ ITEM_PIPELINES = { ‘oldHouse.pipelines.MongoPipeline‘: 300, } DOWNLOADER_MIDDLEWARES = { ‘oldHouse.middlewares.OldhouseDownloaderMiddleware‘: 543, ‘oldHouse.middlewares.MyProxyMiddleWare‘: 542, ‘oldHouse.middlewares.MyUserAgentMiddleWare‘: 541, ‘scrapy.downloadermiddlewares.redirect.RedirectMiddleware‘: None, ‘oldHouse.middlewares.MyRedirectMiddleware‘: 601, ‘oldHouse.middlewares.MyRetryMiddleware‘: 551, ‘scrapy.downloadermiddlewares.useragent.UserAgentMiddleware‘: None, }
以下所有分析中:
real_url表示58同城url鏈接中給的正確url,如https://bj.58.com/ershoufang/37786966127392x.shtml
fake_url表示58同城url鏈接中含‘zd_p‘的url,需要我們對它進行跳轉,跳到real_url,如https://short.58.com/zd_p/887076ce-1bfa-4142-ae0f-59c079a078e9/
jump_url表示由fake_url跳轉到的url,它是獲取到real_url的橋梁,如
firewall表示58同城服務器上的一個驗證url,如GET https://callback.58.com/firewall/verifycode?......
一、在爬取過程中,出現以下情形:
1)real_url -> firewall - > firewall -> firewall -> 重試過多,死掉。給定正確url,由於ip頻繁訪問,跳到58頻繁驗證的url,由於沒有寫模擬驗證代碼,重試兩次之後放棄該url 案例: 2019-04-16 14:19:08 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://callback.58.com/firewall/verifycode?serialId=5167d73136b2b181a1f31897773da5fa_df9c5d69d8f64ab7acbd93658f644092&code=22&sign=9 0346b3cf6733d799b204c2fdb508612&namespace=ershoufangphp&url=https%3A%2F%2Fbj.58.com%2Fershoufang%2F37786966127392x.shtml> from <GET https://bj.58.com/ershoufang/37786966127392x.shtml> 2019-04-16 14:19:18 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://callback.58.com/firewall/verifycode?serialId=5167d73136b2b181a1f31897773da5fa_df9c5d69d8f64ab7acbd93658f644092&code=22&sign=90346b3cf6733d79 9b204c2fdb508612&namespace=ershoufangphp&url=https%3A%2F%2Fbj.58.com%2Fershoufang%2F37786966127392x.shtml> (failed 1 times): An error occurred while connecting: [Failure instance: Traceback (failure with no frames): <class ‘tw isted.internet.error.ConnectionLost‘>: Connection to the other side was lost in a non-clean fashion: Connection lost. 2)real_url -> firewall -> firewall,拿到firewall的頁面信息 -> 由於拿到錯誤頁面,在做數據提取時出現NoneType error報錯 案例: 2019-04-16 14:18:49 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://callback.58.com/firewall/verifycode?serialId=fa0b4cbd0ad45dfd70b236d523d35fe4_4766f82648964a8190d624a446194d0b&code=22&sign=3 6be5e04f16ed03203be421da14859a9&namespace=ershoufangphp&url=https%3A%2F%2Fbj.58.com%2Fershoufang%2F37785831004063x.shtml> from <GET https://bj.58.com/ershoufang/37785831004063x.shtml> 2019-04-16 14:18:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://callback.58.com/firewall/verifycode?serialId=fa0b4cbd0ad45dfd70b236d523d35fe4_4766f82648964a8190d624a446194d0b&code=22&sign=36be5e04f16ed03203be421da14 859a9&namespace=ershoufangphp&url=https%3A%2F%2Fbj.58.com%2Fershoufang%2F37785831004063x.shtml> (referer: https://bj.58.com/ershoufang/) 2019-04-16 14:18:52 [scrapy.core.scraper] ERROR: Spider error processing <GET https://callback.58.com/firewall/verifycode?serialId=fa0b4cbd0ad45dfd70b236d523d35fe4_4766f82648964a8190d624a446194d0b&code=22&sign=36be5e04f16ed032 03be421da14859a9&namespace=ershoufangphp&url=https%3A%2F%2Fbj.58.com%2Fershoufang%2F37785831004063x.shtml> (referer: https://bj.58.com/ershoufang/) Traceback (most recent call last): 3)fake_url -> jump_url -> jump_url -> jump_url放棄url 案例: 2019-04-16 16:24:18 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://jing.58.com/adJump?adType=3&target=pZwY0ZnlszqBpB3draOWUvYfugF1pAqduh78uzt1P1mdnjbYrjEdnHDknL980v6YUyk_uaYYm191nH-hPiYvnWmYsH whrHNVryF6nBdWmWFBmWb3mvNLuAn_nHDQP1bOnWDYnHcLP1DQPjnvrak_FhQfuvIGU-qd0vRzgv-b5HThuA-107qWmgw-5HDzFhwG0LKxUAqWmykqniuWUA--UMwxIgP-0-qGujYhuyOYpgwOpyEqn10vPHTOPj9YPHDQnjnhIgP-0h-b5HmQnHmOnHn1nHnYPWDQFh-VuybqFhR8IA-YXgwO0ANqnau- UMwGIZ-xmv7YuHYhuyOYpgwOgvQfmv7_5iubpgPkgLwGUyNqnHNdPHEkn1T1PWckPaukULPGIA-fUWYzriuWUA-Wpv-b5H9OnWnkPhcOsHNYrHDVPAPBuid6mHFWsH9QuyNYuy7bnvw-raukmgF6UHYQnj0LrWTh0AQ6IAuf0hYqsHDhUA-1IZGb5yQG0LEV0A7zmydLuy-MpZEV0Anh0A7MuRqYXgK-5H D_nHnh0ZFfuZRWIA-b5HDknau1UAqYpyEqnHTknjcdrauYpyEqmyDQnyNOP1bVuW9QPaYYPAEQsHbQm1bVuHNOmvDdmWb3rymQ> from <GET https://short.58.com/zd_p/892306b9-5491-4cbe-aa2c-81ee4ead3de8/> 2019-04-16 16:24:52 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://jing.58.com/adJump?adType=3&target=pZwY0ZnlszqBpB3draOWUvYfugF1pAqduh78uzt1P1mdnjbYrjEdnHDknL980v6YUyk_uaYYm191nH-hPiYvnWmYsHwhrHNVryF6nBdWm WFBmWb3mvNLuAn_nHDQP1bOnWDYnHcLP1DQPjnvrak_FhQfuvIGU-qd0vRzgv-b5HThuA-107qWmgw-5HDzFhwG0LKxUAqWmykqniuWUA--UMwxIgP-0-qGujYhuyOYpgwOpyEqn10vPHTOPj9YPHDQnjnhIgP-0h-b5HmQnHmOnHn1nHnYPWDQFh-VuybqFhR8IA-YXgwO0ANqnau-UMwGIZ-xmv7YuHY huyOYpgwOgvQfmv7_5iubpgPkgLwGUyNqnHNdPHEkn1T1PWckPaukULPGIA-fUWYzriuWUA-Wpv-b5H9OnWnkPhcOsHNYrHDVPAPBuid6mHFWsH9QuyNYuy7bnvw-raukmgF6UHYQnj0LrWTh0AQ6IAuf0hYqsHDhUA-1IZGb5yQG0LEV0A7zmydLuy-MpZEV0Anh0A7MuRqYXgK-5HD_nHnh0ZFfuZRWI A-b5HDknau1UAqYpyEqnHTknjcdrauYpyEqmyDQnyNOP1bVuW9QPaYYPAEQsHbQm1bVuHNOmvDdmWb3rymQ> (failed 1 times): TCP connection timed out: 10060: 由於連接方在一段時間後沒有正確答復或連接的主機沒有反應,連接嘗試失敗。. 2019-04-16 16:24:59 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://jing.58.com/adJump?adType=3&target=pZwY0ZnlszqBpB3draOWUvYfugF1pAqduh78uzt1P1mdnjbYrjEdnHDknL980v6YUyk_uaYYm191nH-hPiYvnWmYsHwhrHNVryF6nBdWm WFBmWb3mvNLuAn_nHDQP1bOnWDYnHcLP1DQPjnvrak_FhQfuvIGU-qd0vRzgv-b5HThuA-107qWmgw-5HDzFhwG0LKxUAqWmykqniuWUA--UMwxIgP-0-qGujYhuyOYpgwOpyEqn10vPHTOPj9YPHDQnjnhIgP-0h-b5HmQnHmOnHn1nHnYPWDQFh-VuybqFhR8IA-YXgwO0ANqnau-UMwGIZ-xmv7YuHY huyOYpgwOgvQfmv7_5iubpgPkgLwGUyNqnHNdPHEkn1T1PWckPaukULPGIA-fUWYzriuWUA-Wpv-b5H9OnWnkPhcOsHNYrHDVPAPBuid6mHFWsH9QuyNYuy7bnvw-raukmgF6UHYQnj0LrWTh0AQ6IAuf0hYqsHDhUA-1IZGb5yQG0LEV0A7zmydLuy-MpZEV0Anh0A7MuRqYXgK-5HD_nHnh0ZFfuZRWI A-b5HDknau1UAqYpyEqnHTknjcdrauYpyEqmyDQnyNOP1bVuW9QPaYYPAEQsHbQm1bVuHNOmvDdmWb3rymQ> (failed 2 times): Could not open CONNECT tunnel with proxy 104.236.248.219:3128 [{‘status‘: 503, ‘reason‘: b‘Service Unavailable‘}] 2019-04-16 16:25:05 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://jing.58.com/adJump?adType=3&target=pZwY0ZnlszqBpB3draOWUvYfugF1pAqduh78uzt1P1mdnjbYrjEdnHDknL980v6YUyk_uaYYm191nH-hPiYvnWmYsHwhrHNVryF6nBdWm WFBmWb3mvNLuAn_nHDQP1bOnWDYnHcLP1DQPjnvrak_FhQfuvIGU-qd0vRzgv-b5HThuA-107qWmgw-5HDzFhwG0LKxUAqWmykqniuWUA--UMwxIgP-0-qGujYhuyOYpgwOpyEqn10vPHTOPj9YPHDQnjnhIgP-0h-b5HmQnHmOnHn1nHnYPWDQFh-VuybqFhR8IA-YXgwO0ANqnau-UMwGIZ-xmv7YuHY huyOYpgwOgvQfmv7_5iubpgPkgLwGUyNqnHNdPHEkn1T1PWckPaukULPGIA-fUWYzriuWUA-Wpv-b5H9OnWnkPhcOsHNYrHDVPAPBuid6mHFWsH9QuyNYuy7bnvw-raukmgF6UHYQnj0LrWTh0AQ6IAuf0hYqsHDhUA-1IZGb5yQG0LEV0A7zmydLuy-MpZEV0Anh0A7MuRqYXgK-5HD_nHnh0ZFfuZRWI A-b5HDknau1UAqYpyEqnHTknjcdrauYpyEqmyDQnyNOP1bVuW9QPaYYPAEQsHbQm1bVuHNOmvDdmWb3rymQ> (failed 3 times): [<twisted.python.failure.Failure twisted.internet.error.ConnectionLost: Connection to the other side was lost in a non-clea n fashion: Connection lost.>] 4)fake_url -> jump_url -> real_url - > firewal難得拿到real_url,又因為請求頻繁等碰墻上了 案例: 2019-04-16 14:19:01 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://jing.58.com/adJump?adType=3&target=pZwY0ZnlszqBpB3draOWUvYfugF1pAqduh78uzt1P10LP1DLP10znHcYnM980v6YUyk_uadhnAn1nhFhnaY3nH6-sH wWuAnVmWEvriYzmHP6PvwWuyRWmhn_nHDQP10zrjnzPW0QnHTknHT3rak_FhQfuvIGU-qd0vRzgv-b5HThuA-107qWmgw-5HDzFhwG0LKxUAqWmykqniuWUA--UMwxIgP-0-qGujYhuyOYpgwOpyEqn10LP10QP10LnWDzPjchIgP-0h-b5HDzrHDvrjnOrjbzPj9vFh-VuybqFhR8IA-YXgwO0ANqnau- UMwGIZ-xmv7YuHYhuyOYpgwOgvQfmv7_5iubpgPkgLwGUyNqnHNdPHnOPHNznjbdnBukULPGIA-fUWY1rauWUA-Wpv-b5H93P1TLPhP-sH7BuhDVPjDYnBd6uHKhsHNOm1TLryDkP16-riukmgF6UHYQnj0LrWTh0AQ6IAuf0hYqsHDhUA-1IZGb5yQG0LEV0A7zmydLuy-MpZEV0Anh0A7MuRqYXgK-5H D_nHnh0ZFfuZRWIA-b5HDknau1UAqYpyEqnHTknjcdrauYpyEqnAEOnyPWrynVrAN3PaYYmvnksyDvuhcVrHTvPjP-m1czPWRh> from <GET https://short.58.com/zd_p/887076ce-1bfa-4142-ae0f-59c079a078e9/> 2019-04-16 14:19:12 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://bj.58.com/ershoufang/37777177721242x.shtml?adtype=3> from <GET https://jing.58.com/adJump?adType=3&target=pZwY0ZnlszqBpB3draO WUvYfugF1pAqduh78uzt1P10LP1DLP10znHcYnM980v6YUyk_uadhnAn1nhFhnaY3nH6-sHwWuAnVmWEvriYzmHP6PvwWuyRWmhn_nHDQP10zrjnzPW0QnHTknHT3rak_FhQfuvIGU-qd0vRzgv-b5HThuA-107qWmgw-5HDzFhwG0LKxUAqWmykqniuWUA--UMwxIgP-0-qGujYhuyOYpgwOpyEqn10LP 10QP10LnWDzPjchIgP-0h-b5HDzrHDvrjnOrjbzPj9vFh-VuybqFhR8IA-YXgwO0ANqnau-UMwGIZ-xmv7YuHYhuyOYpgwOgvQfmv7_5iubpgPkgLwGUyNqnHNdPHnOPHNznjbdnBukULPGIA-fUWY1rauWUA-Wpv-b5H93P1TLPhP-sH7BuhDVPjDYnBd6uHKhsHNOm1TLryDkP16-riukmgF6UHYQnj0 LrWTh0AQ6IAuf0hYqsHDhUA-1IZGb5yQG0LEV0A7zmydLuy-MpZEV0Anh0A7MuRqYXgK-5HD_nHnh0ZFfuZRWIA-b5HDknau1UAqYpyEqnHTknjcdrauYpyEqnAEOnyPWrynVrAN3PaYYmvnksyDvuhcVrHTvPjP-m1czPWRh> 2019-04-16 14:19:18 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://callback.58.com/firewall/verifycode?serialId=75196bd68f771f168bdbcaa7e8a97a6b_f35824ea81fa488aa5e974355cd785da&code=22&sign=7 1677e2c4c84c2db8421e233411db814&namespace=ershoufangphp&url=https%3A%2F%2Fbj.58.com%2Fershoufang%2F37777177721242x.shtml%3Fadtype%3D3> from <GET https://bj.58.com/ershoufang/37777177721242x.shtml?adtype=3>
二、針對以上情形的解決辦法
總思路:
由於爬取這些數據無需登錄,那麽針對58firewall給的較難破解的軌跡驗證方式我們換個ip就好了;
觀察沒有成功訪問的原因主要是出在redirect和retry上,對於retry,由於我管理的代理有效率並不高,並且沒在用flask維護實時代理,因此我會給更大的RETRY_TIMES;對於redirect,可以看到以上有各種url之間的redirect,必然要用上redirect中間件,並且根據不同類型的redirect做不同的process_response,下面詳細解決redirect問題
解決工作:
簡單概括以上四種redirect: 1.real_url -> firewall - > firewall -> firewall -> 重試過多,死掉 原因在於請求過於頻繁,且設置了允許重定向,導致到了firewall而不是重新爬real_url 2.real_url -> firewall -> firewall,拿到firewall的頁面信息 -> 由於拿到錯誤頁面,在做數據提取時出現NoneType error 原因在於請求過於頻繁,且設置了允許重定向,導致到了firewall而不是重新爬real_url 3.fake_url -> jump_url -> jump_url -> jump_url放棄url 極有可能是代理原因導致不停重試 4.fake_url -> jump_url -> real_url - > firewal,難得拿到real_url,又因為請求頻繁等碰墻上了 從fake_url終於重定向到real_url之後仍有可能由於請求頻發導致撞墻,出現第1中情形 逐個分析辦法: 若直接settings.py設置REDIRECT_ENABLED=False就好了,那是不行滴,如情形4,居然能從fake_url跳跳跳一直跳到我們需要的real_url,這就是58同城設的套啊 1和2情形的方案: 1和2自從real_url跳到firewall後就偏離了我們的工作,那麽針對real_url我不讓它跳轉就行了,若當前是real_url則在scrapy.Request中設置dont_redirect=True(默認False),但是這還沒完,real_url說你不讓我跳轉卻又給我分配了一個垃圾IP,強行讓我撞墻,撞了墻又不處理一下,好比是這樣,小強說幫小明打架,結果小強根本沒去,小明被迫1v5被打得鼻青臉腫,小強正在家裏快樂風男。這樣的結果是對於本次redirect沒有後續處理,日誌出現debug: 2019-04-17 08:10:06 [scrapy.core.engine] DEBUG: Crawled (302) <GET https://bj.58.com/ershoufang/> (referer: https://bj.58.com/ershoufang/) 2019-04-17 08:10:06 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <302 https://bj.58.com/ershoufang/>: HTTP status code is not handled or not allowed 2019-04-17 08:10:06 [scrapy.core.engine] INFO: Closing spider (finished) 如果這個real_url出現在後續url中還好,如上所示,real_url出現在初始url,導致第一個url都沒爬到就直接關閉爬蟲,gg。 那麽如何處理呢?幹脆不要針對real_url設置dont_redirect=True了,保持默認全局允許重定向就好,自定義MyRedirectMiddleware,完整繼承RedirectMiddleware方法下,提供監測機制,檢測位置:real_url -> firewall,捕捉這個real_url,在他想跳還沒跳起來之前return Request(real_url...)就好,這還沒完,由於real_url開始是爬取過的,在finger_print中是有記錄的,那麽記得加參數dont_filter=True,並且記得加callback=spider.parse_xxx 情形3方案: 自定義MyRedirectMiddleware,完整繼承RedirectMiddleware方法下,提供監測機制,檢測位置:fake_url -> jump_url,若發現當前跳轉到的目標url為jump_url,就提供更多的重試次數,由於設置好了代理中間件,基本能保證最終拿到real_url了。 情形4方案: 自定義MyRedirectMiddleware,完整繼承RedirectMiddleware方法下,提供監測機制,檢測位置:jump_url -> real_url,若發現當前跳轉到的目標url非jump_url或firewall,則基本確定獲取到real_url了,那麽就讓重定向到real_url上就好了。 說了這麽多,好消息是我們不用管情形4了,jump_url - > real_url部分由於全局允許重定向,並且在情形3jump_url設置了不停的retry,是一定能拿到real_url的,而real_url - > firewall部分不正是情況1所要解決的嗎,所以情形4方案迎刃而解。
具體方案代碼:選自redirect中間件部分代碼
# -*- coding:utf-8 -*- # Author: Tarantiner # @Time :2019/4/17 18:26 class MyRedirectMiddleware(BaseRedirectMiddleware): def process_response(self, request, response, spider): if (request.meta.get(‘dont_redirect‘, False) or response.status in getattr(spider, ‘handle_httpstatus_list‘, []) or response.status in request.meta.get(‘handle_httpstatus_list‘, []) or request.meta.get(‘handle_httpstatus_all‘, False)): return response allowed_status = (301, 302, 303, 307, 308) if ‘Location‘ not in response.headers or response.status not in allowed_status: return response location = safe_url_string(response.headers[‘location‘]) redirected_url = urljoin(request.url, location) if response.status in (301, 307, 308) or request.method == ‘HEAD‘: redirected = request.replace(url=redirected_url) return self._redirect(redirected, request, spider, response.status) if ‘firewall‘ in redirected_url: # 為防止1、2類情況:real_url -> firewall return Request(response.url, callback=spider.parse_detail, dont_filter=True, meta={‘dont_redirect‘: False}) if ‘Jump‘ in redirected_url: # 為防3類情況:fake_url -> jump_url -> jump_url -> jump_url放棄url new_request = request.replace(url=redirected_url, method=‘GET‘, body=‘‘, meta={‘max_retry_times‘: 12}) # 每次遇到這個跳轉url都會加一次retry就是無線retry了 else: new_request = self._redirect_request_using_get(request, redirected_url) return self._redirect(new_request, request, spider, response.status)
解決後的爬取效果:
情形1、2的效果:real_url -> real_url -> 200,如下 redirected_url: https://callback.58.com/firewall/verifycode?serialId=8b8b4a1ead5a3ded505d96dcc8e42004_21b60bb0e6194aeea99c0b42f0f99c2f&code=22&sign=f86bd444c70b93fc537503ef857276ec&namespace=ershoufangphp&url=https%3A%2F%2Fbj.58.com%2Fershoufang%2F37688560543505x.shtml response_url: https://bj.58.com/ershoufang/37688560543505x.shtml 2019-04-17 18:49:02 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://bj.58.com/ershoufang/37610200172685x.shtml?adtype=3> (referer: https://bj.58.com/ershoufang/) 2019-04-17 18:49:02 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bj.58.com/ershoufang/37610200172685x.shtml?adtype=3> 可見,在real_url -> firewall之後,並沒有真正爬取firewall,而是繼續爬取real_url,返回200 情形3的效果:fake_url -> jump_url -> jump_url -> real_url -> 200,如下 2019-04-16 22:43:27 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://jing.58.com/adJump?adType=3&target=pZwY0ZnlszqBpB3draOWUvYfugF1pAqduh78uzt1P1N3P1mzPjT1PHDknL980v6YUyk_uaY3PH6bmHwbmiY3PhDdsH wBnHnVrHnzridbuHckPjPbmHmvP1N_nHDQn1cLPH9dP101n1bdPHN3Pak_FhQfuvIGU-qd0vRzgv-b5HThuA-107qWmgw-5HDzFhwG0LKxUAqWmykqniuWUA--UMwxIgP-0-qGujYhuyOYpgwOpyEqn10drj0vnWEkn1NQnjnhIgP-0h-b5Hmkn10QrHTvn1NznHnLFh-VuybqFhR8IA-YXgwO0ANqnau- UMwGIZ-xmv7YuHYhuyOYpgwOgvQfmv7_5iubpgPkgLwGUyNqnHNdPHEzPH0OPHmLnBukULPGIA-fUWYOFhP_pyPopyEqnAmzuW0QnjNVn10kPiYYryF-sH6brynVmvDYmH0QPHEvPyRhFMK60h7V5HDkP10lnaukUA7YuhqzUHYVniu_pgPYXhEqUA-1IadkmgF6UgI-pyICIadkmzukmyI-gLwO0ANqni kQnzuk0hqbIyPYpyEqnHTkFMP_ULwGujYQnjTknWN3FMwGujYvuHR6P1PWuiYkuyNdsHELryEVrAmOmzYzP101rAmkrju6PjN> from <GET https://short.58.com/zd_p/0f2f7105-3705-49be-8d9c-ca4a715465ef/> 2019-04-16 22:43:59 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://jing.58.com/adJump?adType=3&target=pZwY0ZnlszqBpB3draOWUvYfugF1pAqduh78uzt1P1N3P1mzPjT1PHDknL980v6YUyk_uaY3PH6bmHwbmiY3PhDdsHwBnHnVrHnzridbu HckPjPbmHmvP1N_nHDQn1cLPH9dP101n1bdPHN3Pak_FhQfuvIGU-qd0vRzgv-b5HThuA-107qWmgw-5HDzFhwG0LKxUAqWmykqniuWUA--UMwxIgP-0-qGujYhuyOYpgwOpyEqn10drj0vnWEkn1NQnjnhIgP-0h-b5Hmkn10QrHTvn1NznHnLFh-VuybqFhR8IA-YXgwO0ANqnau-UMwGIZ-xmv7YuHY huyOYpgwOgvQfmv7_5iubpgPkgLwGUyNqnHNdPHEzPH0OPHmLnBukULPGIA-fUWYOFhP_pyPopyEqnAmzuW0QnjNVn10kPiYYryF-sH6brynVmvDYmH0QPHEvPyRhFMK60h7V5HDkP10lnaukUA7YuhqzUHYVniu_pgPYXhEqUA-1IadkmgF6UgI-pyICIadkmzukmyI-gLwO0ANqnikQnzuk0hqbIyPYp yEqnHTkFMP_ULwGujYQnjTknWN3FMwGujYvuHR6P1PWuiYkuyNdsHELryEVrAmOmzYzP101rAmkrju6PjN> (failed 1 times): TCP connection timed out: 10060: 由於連接方在一段時間後沒有正確答復或連接的主機沒有反應,連接嘗試失敗。 2019-04-16 22:44:17 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://bj.58.com/ershoufang/37587624035103x.shtml?adtype=3> from <GET https://jing.58.com/adJump?adType=3&target=pZwY0ZnlszqBpB3draO WUvYfugF1pAqduh78uzt1P1N3P1mzPjT1PHDknL980v6YUyk_uaY3PH6bmHwbmiY3PhDdsHwBnHnVrHnzridbuHckPjPbmHmvP1N_nHDQn1cLPH9dP101n1bdPHN3Pak_FhQfuvIGU-qd0vRzgv-b5HThuA-107qWmgw-5HDzFhwG0LKxUAqWmykqniuWUA--UMwxIgP-0-qGujYhuyOYpgwOpyEqn10dr j0vnWEkn1NQnjnhIgP-0h-b5Hmkn10QrHTvn1NznHnLFh-VuybqFhR8IA-YXgwO0ANqnau-UMwGIZ-xmv7YuHYhuyOYpgwOgvQfmv7_5iubpgPkgLwGUyNqnHNdPHEzPH0OPHmLnBukULPGIA-fUWYOFhP_pyPopyEqnAmzuW0QnjNVn10kPiYYryF-sH6brynVmvDYmH0QPHEvPyRhFMK60h7V5HDkP10 lnaukUA7YuhqzUHYVniu_pgPYXhEqUA-1IadkmgF6UgI-pyICIadkmzukmyI-gLwO0ANqnikQnzuk0hqbIyPYpyEqnHTkFMP_ULwGujYQnjTknWN3FMwGujYvuHR6P1PWuiYkuyNdsHELryEVrAmOmzYzP101rAmkrju6PjN> 2019-04-16 22:44:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://bj.58.com/ershoufang/37587624035103x.shtml?adtype=3> (referer: https://bj.58.com/ershoufang/) 2019-04-16 22:44:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bj.58.com/ershoufang/37587624035103x.shtml?adtype=3> 情形4的效果:fake_url -> jump_url -> real_url -> 200,如下 2019-04-16 22:43:33 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://jing.58.com/adJump?adType=3&target=pZwY0ZnlszqBpB3draOWUvYfugF1pAqduh78uzt1P10dPWTYPHT3n1T1nZ980v6YUyk_uaY3PH6bmHwbmiY3PhDdsH wBnHnVrHnzridbuHckPjPbmHmvP1N_nHDQPWcvrjDkPj0vP1cdPjNzrak_FhQfuvIGU-qd0vRzgv-b5HThuA-107qWmgw-5HDzFhwG0LKxUAqWmykqniuWUA--UMwxIgP-0-qGujYhuyOYpgwOpyEqn10LPHmkPjNkrjnkn1ThIgP-0h-b5HN3P1Dkn1EOPWT1rjN3Fh-VuybqFhR8IA-YXgwO0ANqnau- UMwGIZ-xmv7YuHYhuyOYpgwOgvQfmv7_5iubpgPkgLwGUyNqnHNdPHEzPH0OPHmLnBukULPGIA-fUWYdFhP_pyPopyEqmW76rHwbrjEVujb1mBYYnW66sHb1PjcVmWELujN1nH-Bm193FMK60h7V5HDkP10lnaukUA7YuhqzUHYVniu_pgPYXhEqUA-1IadkmgF6UgI-pyICIadkmzukmyI-gLwO0ANqni kQnzuk0hqbIyPYpyEqnHTkFMP_ULwGujYQnjTknWN3FMwGujYvnjPWnHbvmzY3P1-6sHEvrjTVrywbuBYzmWNOuHw-nAmzPj9> from <GET https://short.58.com/zd_p/b1a94d84-d93b-428a-9342-b47d5319bc88/> 2019-04-16 22:43:40 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://bj.58.com/ershoufang/37756045083030x.shtml?adtype=3> from <GET https://jing.58.com/adJump?adType=3&target=pZwY0ZnlszqBpB3draO WUvYfugF1pAqduh78uzt1P10dPWTYPHT3n1T1nZ980v6YUyk_uaY3PH6bmHwbmiY3PhDdsHwBnHnVrHnzridbuHckPjPbmHmvP1N_nHDQPWcvrjDkPj0vP1cdPjNzrak_FhQfuvIGU-qd0vRzgv-b5HThuA-107qWmgw-5HDzFhwG0LKxUAqWmykqniuWUA--UMwxIgP-0-qGujYhuyOYpgwOpyEqn10LP HmkPjNkrjnkn1ThIgP-0h-b5HN3P1Dkn1EOPWT1rjN3Fh-VuybqFhR8IA-YXgwO0ANqnau-UMwGIZ-xmv7YuHYhuyOYpgwOgvQfmv7_5iubpgPkgLwGUyNqnHNdPHEzPH0OPHmLnBukULPGIA-fUWYdFhP_pyPopyEqmW76rHwbrjEVujb1mBYYnW66sHb1PjcVmWELujN1nH-Bm193FMK60h7V5HDkP10 lnaukUA7YuhqzUHYVniu_pgPYXhEqUA-1IadkmgF6UgI-pyICIadkmzukmyI-gLwO0ANqnikQnzuk0hqbIyPYpyEqnHTkFMP_ULwGujYQnjTknWN3FMwGujYvnjPWnHbvmzY3P1-6sHEvrjTVrywbuBYzmWNOuHw-nAmzPj9> 2019-04-16 22:43:53 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://bj.58.com/ershoufang/37756045083030x.shtml?adtype=3> (referer: https://bj.58.com/ershoufang/) 2019-04-16 22:43:53 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bj.58.com/ershoufang/37756045083030x.shtml?adtype=3>
當然,我提取到一份絕好的日誌,看著這url如預期般地redirect真是舒服了:
fake_url -> jump_url -> real_url -> retry 1 times -> retry 2 times --- firewall但是並沒有真的去,而是重新Request ---> real_url -> 200 2019-04-17 16:26:52 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://jing.58.com/adJump?adType=3&target=pZwY0ZnlszqBpB3draOWUvYfugF1pAqduh78uzt1P1m3rj0OP1TOrjmdng980v6YUyk_uadbujNYmyEOuBYknhuWsH EYPWNVrjmLmiYdmHKbmWTkmyDQmhc_nHDQrjEkPj9vn1m1rjmYPW03Pak_FhQfuvIGU-qd0vRzgv-b5HThuA-107qWmgw-5HDzFhwG0LKxUAqWmykqniuWUA--UMwxIgP-0-qGujYhuyOYpgwOpyEqn10vrj9LrH0krH9vPHDhIgP-0h-b5HNLP1DLnHT3rj9Orj0YFh-VuybqFhR8IA-YXgwO0ANqnau- UMwGIZ-xmv7YuHYhuyOYpgwOgvQfmv7_5iubpgPkgLwGUyNqnHNdPHE3rHN3PH0vnaukULPGIA-fUWYQnauWUA-Wpv-b5HbkPWn1mHm1sHIhnjmVPj-6raY3rHFBsHT3njubnAPhP1bvuBukmgF6UHYQnj0LrWTh0AQ6IAuf0hYqsHDhUA-1IZGb5yQG0LEV0A7zmydLuy-MpZEV0Anh0A7MuRqYXgK-5H D_nHnh0ZFfuZRWIA-b5HDknau1UAqYpyEqnHTknjcdrauYpyEqmHnQnHFhnjNVnjuWuiYYmHFBsH--rHcVuyu-rAm3rjmzPWTL> from <GET https://short.58.com/zd_p/90633a63-7f06-49a8-892b-0806d0cf796f/> 2019-04-17 16:27:04 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://bj.58.com/ershoufang/37688797098651x.shtml?adtype=3> from <GET https://jing.58.com/adJump?adType=3&target=pZwY0ZnlszqBpB3draO WUvYfugF1pAqduh78uzt1P1m3rj0OP1TOrjmdng980v6YUyk_uadbujNYmyEOuBYknhuWsHEYPWNVrjmLmiYdmHKbmWTkmyDQmhc_nHDQrjEkPj9vn1m1rjmYPW03Pak_FhQfuvIGU-qd0vRzgv-b5HThuA-107qWmgw-5HDzFhwG0LKxUAqWmykqniuWUA--UMwxIgP-0-qGujYhuyOYpgwOpyEqn10vr j9LrH0krH9vPHDhIgP-0h-b5HNLP1DLnHT3rj9Orj0YFh-VuybqFhR8IA-YXgwO0ANqnau-UMwGIZ-xmv7YuHYhuyOYpgwOgvQfmv7_5iubpgPkgLwGUyNqnHNdPHE3rHN3PH0vnaukULPGIA-fUWYQnauWUA-Wpv-b5HbkPWn1mHm1sHIhnjmVPj-6raY3rHFBsHT3njubnAPhP1bvuBukmgF6UHYQnj0 LrWTh0AQ6IAuf0hYqsHDhUA-1IZGb5yQG0LEV0A7zmydLuy-MpZEV0Anh0A7MuRqYXgK-5HD_nHnh0ZFfuZRWIA-b5HDknau1UAqYpyEqnHTknjcdrauYpyEqmHnQnHFhnjNVnjuWuiYYmHFBsH--rHcVuyu-rAm3rjmzPWTL> 2019-04-17 16:27:26 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://bj.58.com/ershoufang/37688797098651x.shtml?adtype=3> (failed 1 times): TCP connection timed out: 10060: 由於連接方在一段時間後沒有正確答復或 連接的主機沒有反應,連接嘗試失敗。 2019-04-17 16:28:00 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://bj.58.com/ershoufang/37688797098651x.shtml?adtype=3> (failed 2 times): TCP connection timed out: 10060: 由於連接方在一段時間後沒有正確答復或 連接的主機沒有反應,連接嘗試失敗。 redirect_url: https://callback.58.com/firewall/verifycode?serialId=70e3ea25cb505bc3d0746bb61d508d53_6da701bcb6ca44fd92bbe820a73dca84&code=22&sign=cc2a1d287fa102f0f21d33d91b3c51ea&namespace=ershoufangphp&url=https%3A %2F%2Fbj.58.com%2Fershoufang%2F37688797098651x.shtml%3Fadtype%3D3 response_url: https://bj.58.com/ershoufang/37688797098651x.shtml?adtype=3 2019-04-17 16:28:50 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://bj.58.com/ershoufang/37688797098651x.shtml?adtype=3> (referer: https://bj.58.com/ershoufang/) 2019-04-17 16:28:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://bj.58.com/ershoufang/37688797098651x.shtml?adtype=3> 這個爬取路徑可以說走過了全部4種情形,而最終還是順利爬取到數據,應該比較有代表性了
其中一份日誌結果:
{‘downloader/exception_count‘: 136, ‘downloader/exception_type_count/scrapy.core.downloader.handlers.http11.TunnelError‘: 16, ‘downloader/exception_type_count/twisted.internet.error.ConnectionRefusedError‘: 11, ‘downloader/exception_type_count/twisted.internet.error.TCPTimedOutError‘: 76, ‘downloader/exception_type_count/twisted.internet.error.TimeoutError‘: 30, ‘downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived‘: 3, ‘downloader/request_bytes‘: 384128, ‘downloader/request_count‘: 750, ‘downloader/request_method_count/GET‘: 750, ‘downloader/response_bytes‘: 2385832, ‘downloader/response_count‘: 614, ‘downloader/response_status_count/200‘: 123, ‘downloader/response_status_count/302‘: 490, ‘downloader/response_status_count/504‘: 1, ‘dupefilter/filtered‘: 122, ‘finish_reason‘: ‘finished‘, ‘finish_time‘: datetime.datetime(2019, 4, 17, 10, 52, 26, 392186), ‘item_scraped_count‘: 122, ‘log_count/DEBUG‘: 500, ‘log_count/INFO‘: 27, ‘log_count/WARNING‘: 2, ‘request_depth_max‘: 1, ‘response_received_count‘: 123, ‘retry/count‘: 137, ‘retry/reason_count/504 Gateway Time-out‘: 1, ‘retry/reason_count/scrapy.core.downloader.handlers.http11.TunnelError‘: 16, ‘retry/reason_count/twisted.internet.error.ConnectionRefusedError‘: 11, ‘retry/reason_count/twisted.internet.error.TCPTimedOutError‘: 76, ‘retry/reason_count/twisted.internet.error.TimeoutError‘: 30, ‘retry/reason_count/twisted.web._newclient.ResponseNeverReceived‘: 3, ‘scheduler/dequeued‘: 750, ‘scheduler/dequeued/memory‘: 750, ‘scheduler/enqueued‘: 750, ‘scheduler/enqueued/memory‘: 750, ‘start_time‘: datetime.datetime(2019, 4, 17, 10, 33, 6, 936247)} 2019-04-17 18:52:26 [scrapy.core.engine] INFO: Spider closed (finished)
可以看到還有許多地方需要改進,後續會分享我的優化思路O(∩_∩)O
scrapy爬取58同城二手房問題與對策