1. 程式人生 > >python爬蟲——抓取自如網房源,匯出為csv

python爬蟲——抓取自如網房源,匯出為csv

1.抓取自如網房源,其實為了後面一個小專案做資料採集工作
2.為什麼選擇自如,是因為我做租房的同學說,自如網的房源質量比較高
3.因為博主是暫居深圳,就先以深圳市的房源為示例

base_url = "http://sz.ziroom.com/z/nl/z3.html"

起始地址,全是get請求,就可以拿到資料,那麼,十分的簡單,

1.構造網址

base_url = "http://sz.ziroom.com/z/nl/z3.html"
class Get_one_page:
    def __init__(self,page):
        self.page = page
        self.parmas = {"p"
: page} self.appartments = []

2.getpage頁面,拿出房源資訊

    def getpage(self):
        try:
            time.sleep(random.randint(1,2))
            response = requests.get(url=base_url,params=self.parmas,headers=my_headers)
        except Exception as e:
            print("get方法失敗"+self.page)
            print(e)
            return
if response.status_code == 200: soup = BeautifulSoup(response.text,"lxml") ul = soup.select("ul[id='houseList']") li_list = ul[0].select("li") else: print("狀態碼不為200------"+self.page) return for li in li_list: address = li.select("h3"
)[0].text + "," + soup.select("h4")[0].text # 獲取房源地址 descripe = li.select(".detail")[0].text.replace(" " , "").replace("\n" , ",")[2:] # 獲取房源描述資訊 tags = li.select(".room_tags")[0].text.replace("\n" , ",")[1:] # 獲取房源標籤 more_href = "http:" + li.select('.more a')[0].attrs["href"] # 詳情連結 img_src = "http:" + li.select("img")[0].attrs["_src"] # 圖片連結 price = self.get_price(more_href) room = {"address": address, "descripe": descripe, "tags": tags, "more": more_href, "img_src": img_src, "price": price} self.appartments.append(room)

這裡寫圖片描述
房源最重要的資訊-價格,在這個地址中是以圖片拼出來的,我們只有進入每個房源的詳情頁,才能以足夠簡單的方法獲取價格,所以再定義一個get_price方法,引數是房源詳情頁的地址

3.get_price獲取房源價格

    def get_price(self, href):
        """返回的是季付每月租金, 從更多頁面中獲取"""
         try:
            time.sleep(random.randint(1 , 2))
            response = requests.get(url=href, headers=my_headers)
        except Exception as e:
            print("get方法失敗" + href)
            print(e)
            price = "0"
        if response.status_code == 200:
            soup = BeautifulSoup(response.text , "lxml")
            try:
                price =soup.select("#room_price")[0].text
            except:
                print(href)
                price = "0"
        else:
            print("狀態碼不為200------" + href)
            price = "0"
        regex = "\d+"
        if price == None:
            print(href)
            price = "0"
        return re.findall(regex, price)[0]

這裡寫圖片描述

4.匯出為scv格式檔案

    def writedata(self):
        def write_csv_file(path , head , data):
            try:
                with open(path , 'w' , newline='' , encoding="utf-8") as csv_file:
                    writer = csv.writer(csv_file , dialect='excel')
                    if head is not None:
                        writer.writerow(head)
                    for row in data:
                        row_data = []
                        for k in head:
                            row_data.append(row[k])
                        row_data = tuple(row_data)
                        # print(row_data)
                        writer.writerow(row_data)

                    print("Write a CSV file to path %s Successful." % path)
            except Exception as e:
                print("Write an CSV file to path: %s, Case: %s" % (path , e))

        head = ("address" , "descripe" , "tags" , "more" , "img_src" , "price")
        write_csv_file(self.path , head , self.appartments)

5.迴圈抓取50頁房源並寫入本地

if __name__ == '__main__':
    for i in range(1,51):
        Get_one_page(i)

慢慢寫是為了方便哪頁出錯好排查,重爬資料代價小

6.合併50頁房源

這裡寫圖片描述

這裡寫圖片描述

import pandas as pd

dfs = []
for i in range(1,51):
    path = "ziru/page_%d.csv"%i
    # 匯入資料
    df = pd.read_csv(path,encoding="utf-8")
    dfs.append(df)
# 合併資料
ziru = pd.concat(dfs,ignore_index=True)
# 匯出資料
ziru.to_csv("ziru.csv")

7.房源資訊展示

這裡寫圖片描述
這裡寫圖片描述