python爬蟲——抓取自如網房源,匯出為csv
阿新 • • 發佈:2019-02-13
1.抓取自如網房源,其實為了後面一個小專案做資料採集工作
2.為什麼選擇自如,是因為我做租房的同學說,自如網的房源質量比較高
3.因為博主是暫居深圳,就先以深圳市的房源為示例
base_url = "http://sz.ziroom.com/z/nl/z3.html"
起始地址,全是get請求,就可以拿到資料,那麼,十分的簡單,
1.構造網址
base_url = "http://sz.ziroom.com/z/nl/z3.html"
class Get_one_page:
def __init__(self,page):
self.page = page
self.parmas = {"p" : page}
self.appartments = []
2.getpage頁面,拿出房源資訊
def getpage(self):
try:
time.sleep(random.randint(1,2))
response = requests.get(url=base_url,params=self.parmas,headers=my_headers)
except Exception as e:
print("get方法失敗"+self.page)
print(e)
return
if response.status_code == 200:
soup = BeautifulSoup(response.text,"lxml")
ul = soup.select("ul[id='houseList']")
li_list = ul[0].select("li")
else:
print("狀態碼不為200------"+self.page)
return
for li in li_list:
address = li.select("h3" )[0].text + "," + soup.select("h4")[0].text # 獲取房源地址
descripe = li.select(".detail")[0].text.replace(" " , "").replace("\n" , ",")[2:] # 獲取房源描述資訊
tags = li.select(".room_tags")[0].text.replace("\n" , ",")[1:] # 獲取房源標籤
more_href = "http:" + li.select('.more a')[0].attrs["href"] # 詳情連結
img_src = "http:" + li.select("img")[0].attrs["_src"] # 圖片連結
price = self.get_price(more_href)
room = {"address": address,
"descripe": descripe,
"tags": tags,
"more": more_href,
"img_src": img_src,
"price": price}
self.appartments.append(room)
房源最重要的資訊-價格,在這個地址中是以圖片拼出來的,我們只有進入每個房源的詳情頁,才能以足夠簡單的方法獲取價格,所以再定義一個get_price方法,引數是房源詳情頁的地址
3.get_price獲取房源價格
def get_price(self, href):
"""返回的是季付每月租金, 從更多頁面中獲取"""
try:
time.sleep(random.randint(1 , 2))
response = requests.get(url=href, headers=my_headers)
except Exception as e:
print("get方法失敗" + href)
print(e)
price = "0"
if response.status_code == 200:
soup = BeautifulSoup(response.text , "lxml")
try:
price =soup.select("#room_price")[0].text
except:
print(href)
price = "0"
else:
print("狀態碼不為200------" + href)
price = "0"
regex = "\d+"
if price == None:
print(href)
price = "0"
return re.findall(regex, price)[0]
4.匯出為scv格式檔案
def writedata(self):
def write_csv_file(path , head , data):
try:
with open(path , 'w' , newline='' , encoding="utf-8") as csv_file:
writer = csv.writer(csv_file , dialect='excel')
if head is not None:
writer.writerow(head)
for row in data:
row_data = []
for k in head:
row_data.append(row[k])
row_data = tuple(row_data)
# print(row_data)
writer.writerow(row_data)
print("Write a CSV file to path %s Successful." % path)
except Exception as e:
print("Write an CSV file to path: %s, Case: %s" % (path , e))
head = ("address" , "descripe" , "tags" , "more" , "img_src" , "price")
write_csv_file(self.path , head , self.appartments)
5.迴圈抓取50頁房源並寫入本地
if __name__ == '__main__':
for i in range(1,51):
Get_one_page(i)
慢慢寫是為了方便哪頁出錯好排查,重爬資料代價小
6.合併50頁房源
import pandas as pd
dfs = []
for i in range(1,51):
path = "ziru/page_%d.csv"%i
# 匯入資料
df = pd.read_csv(path,encoding="utf-8")
dfs.append(df)
# 合併資料
ziru = pd.concat(dfs,ignore_index=True)
# 匯出資料
ziru.to_csv("ziru.csv")
7.房源資訊展示