Python爬蟲實戰--58同城二手商品爬蟲

阿新 • • 發佈：2018-11-23

嗚嗚~~本來說今天就把程式碼上傳上來，可惜了，還是有點差錯，今天估計趕不上啦！明天加油吧！
今天我們一起來好好分析一下，看看我們該如何去爬去58二手商品。
這裡我們分成四步來完成本次任務~
目標站點分析
目標URL：http://bj.58.com/sale.shtml
第一步：主頁分析
在這裡插入圖片描述
在主頁裡面，我們需要提取說有的二級分類，特別注意，綠色框的內容，它的格式和內容，如其他的內容相差較大，所以我們一開始就把它剔除，從而在一定程度上減輕程式碼量和工作內容。

 # 解析main_url = "http://bj.58.com/sale.shtml" 獲取分類URL
    def parse_index(self):
        category_urls = []
        html = self.parse_url(self.main_url)
        soup = BeautifulSoup(html.text, 'lxml')
        links = soup.select("ul.ym-submnu > li > b > a")
        for link in links:
            link_url = urljoin(self.main_url, link.attrs['href'])
            category_urls.append(link_url)
        self.DB.insert(collection='Category_urls', item=category_urls)

第二步：提取詳情頁連結
在這裡插入圖片描述
觀察列表頁，我們發現，有一些的詳情頁中不存在‘價格’，所以我們可以在提取詳情頁連結的時候直接將其篩選出來，這樣我們在資料分析的時候就可以少做一點事了。

 # 解析分類連結並提取獲取商品詳情URL
    def parse_page_url(self):
        items = self.DB.find_data(collection='Category_urls')
        for item in items:
            cate_urls = item['category_urls']
            for cate_url in cate_urls:
                page_urls = []
                self.proxy = self.get_proxy()
                if 'tongxunyw' in cate_url:
                    print("error", cate_url)
                    pass
                elif 'ershouqiugou' in cate_url:
                    print("error", cate_url)
                    pass
                else:
                    print(cate_url)
                    for i in range(1, 100):
                        print("Parse page ", i)
                        url = cate_url + 'pn{}/'.format(i)
                        html = self.parse_url(url)
                        soup = BeautifulSoup(html.text, 'lxml')
                        trs = soup.select("tr[_pos='0']")
                        for tr in trs:
                            price = tr.select("b.pri")[0].get_text()
                            if price != '面議':
                                temp_link = tr.select("td.t > a:nth-of-type(1)")[0].attrs['href']
                                page_url = urljoin(self.main_url, temp_link)
                                page_urls.append(page_url)
                    item = dict(
                        page_urls=page_urls,
                    )
                    self.DB.insert(collection='Page_urls', items=item)

第三步：提取詳情頁資料
由於該詳情頁大致分為三種，所以必須採用三種模式去提取資料，如圖
在這裡插入圖片描述
一眼看上去都差不多，但實際上裡面的內容卻有著很大的差別，比如說來之轉轉的內容，你不能直接衝網頁裡提取，而是的痛過轉轉API，獲取json資料，在提取！筆者在編碼時，也是除錯了很多次才發現，原來這裡有三種頁面。醉了！沒有辦法，只好硬著頭皮，寫下去~

   # 解析Old58_urls網頁，pp
    def parse_page_58(self, page_url):
        html = self.parse_url(page_url)
        soup = BeautifulSoup(html.text, 'lxml')
        temp_title = soup.title.get_text()
        title = temp_title.split(" - ")[0]
        try:
            temp_time = soup.select("div.detail-title__info > div")[0].get_text()
            time = temp_time.split(" ")[0]
            temp_price = soup.select("span.infocard__container__item__main__text--price")[0].get_text()
            price = temp_price.split()[0]
            temp = soup.select("div.infocard__container > div:nth-of-type(2) > div:nth-of-type(2)")[0].get_text()
            if '成新' in temp:
                color = temp
                temp_area = soup.select("div.infocard__container > div:nth-of-type(3) > div:nth-of-type(2)")[0]
            else:
                color = None
                temp_area = soup.select("div.infocard__container > div:nth-of-type(2) > div:nth-of-type(2)")[0]
            temp_area = list(temp_area.stripped_strings)
            area = list(filter(lambda x: x.replace("-", ''), temp_area))
            temp_cate = list(soup.select("div.nav")[0].stripped_strings)
            cate = list(filter(lambda x: x.replace(">", ''), temp_cate))
            item = dict(
                title=title,
                time=time,
                price=price,
                color=color,
                area=area,
                cate=cate,
                )
            print(item)
            self.DB.insert(collection='Page_data', items=item)
        except:
            print("Error 404!")

    # 解析New58_urls網頁，並提取資料
    def parse_page_now58(self, page_url):
        html = self.parse_url(page_url)
        soup = BeautifulSoup(html.text, 'lxml')
        try:
            title = soup.select("div.detail-info-tit")[0].get_text()
            temp_cate = soup.select("div.nav")[0]
            temp_cate = list(temp_cate.stripped_strings)
            cate = list(filter(lambda x: x.replace(">", ''), temp_cate))
            ul = soup.select("ul.detail-info-bd")[0]
            time = ul.select("li:nth-of-type(1) > span:nth-of-type(2)")[0].get_text()
            color = ul.select("li:nth-of-type(2) > span:nth-of-type(2)")[0].get_text()
            area = ul.select("li:nth-of-type(3) > span:nth-of-type(2)")[0].get_text()
            temp_price = soup.select("span.info-price-money")[0].get_text()
            price = temp_price.split("￥")[-1]
            item = dict(
                title=title,
                time=time,
                price=price,
                color=color,
                area=area,
                cate=cate,
            )
            print(item)
            self.DB.insert(collection='Page_data', items=item)
        except:
            print("Error 404!")

    # 解析來自ZZ58_urls網頁，並提取json資料
    def parse_page_zz(self, page_url):
        pattern = re.compile('infoId=(.*?)&', re.S)
        infoId = re.findall(pattern, page_url)
        html = self.parse_url(self.temp_api.format(infoId[0]))
        try:
            data_json = html.json()
            data = data_json.get('respData')
            local = data.get("location")
            item = dict(
                title=data.get('title'),
                browse_times=data.get("browseCount"),
                price=data.get("nowPrice"),
                origin_price=data.get("oriPrice"),
                area=local.get("local"),
            )
            print(item)
            self.DB.insert(collection='Page_data', items=item)
        except:
            print("Error 404!")

第四步：爬蟲邏輯
由於整個的爬去量有點大，內容也相當的多，所以，我把資料都儲存到了MongoDB裡面，方便後續的使用，在爬蟲時，也採用多程序分步的方式去爬去內容。減輕伺服器的承擔量~

# 邏輯實現
    def run(self):
        if INDEX_ENABLED:
            Index_process = Process(target=self.parse_index)
            Index_process.start()

        if LINK_ENABLED:
            PageUrl_process = Process(target=self.parse_page_url)
            PageUrl_process.start()

        if ITEM_ENABLED:
            Page_process = Process(target=self.parse_page)
            Page_process.start()

好了，今天的爬蟲，就先講到這裡啦，程式碼裡還有一些問題，明天除錯好了，上傳上來~
原始碼地址：https://github.com/NO1117/Sale58_Spider
Python交流群：942913325 歡迎大家一起交流學習
Question：如何進一步提高爬蟲效率？大家可以考慮一下多執行緒多程序的方式~實現的可以找我聊聊哈

Python爬蟲實戰--58同城二手商品爬蟲

Python爬蟲實戰--58同城二手商品爬蟲

詳解如何批量採集58同城二手房資料及中介聯絡方式

【爬蟲】58同城字型加密破解手段

Python爬取58同城招聘資訊

scrapy爬取58同城二手房問題與對策

python：爬取58同城全部二手商品資訊（轉轉網）

Python爬蟲(二)——對開封市58同城出租房數據進行分析

Python爬蟲(三)——開封市58同城出租房決策樹構建

Python爬蟲實戰--58二手爬蟲預告

以58同城為例詳解如何用爬蟲採集二手房房源資料及中介聯絡方式

Python爬蟲學習_多程序爬取58同城

成都58同城快速租房的爬蟲，nodeJS爬蟲

使用爬蟲scrapy庫爬取58同城出租房的聯絡方式地址

python爬蟲實戰（四）：selenium爬蟲抓取阿里巴巴採購批發商品

爬取58同城的二手房資訊

python 爬蟲實戰專案--爬取京東商品資訊（價格、優惠、排名、好評率等）

58同城2018校招前端筆試題總結

陳春雷【58同城簡歷采集，真實手機號聯系方式獲取】

58同城app|58同城app安卓版下載

58同城的字型解碼

Python爬蟲實戰--58同城二手商品爬蟲

相關推薦