1. 程式人生 > >爬蟲程式2-爬取酷狗top500

爬蟲程式2-爬取酷狗top500

爬取的內容為酷狗榜單中酷狗top500的音樂資訊,如圖所示。

網頁版酷狗不能手動翻頁,進行下一步的瀏覽。但通過觀察第一頁的URL:

http://www.kugou.com/yy/rank/home/1-8888.html

這裡嘗試把數字1換為數字2,進行瀏覽,恰好返回的是第2頁的資訊(下圖)。進行多次嘗試,發現更換不同數字即為不同頁面,故只需更改home/後面的數字即可。由於每頁顯示的為22首歌曲,所以總共需要23個URL

import requests
from bs4 import BeautifulSoup
import time

headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36'
}

def get_info(url):
wb_data = requests.get(url,headers=headers)
soup = BeautifulSoup(wb_data.text,'lxml')
ranks = soup.select('span.pc_temp_num')
titles = soup.select('div.pc_temp_songlist > ul > li > a')
times = soup.select('span.pc_temp_tips_r > span')
for rank,title,time in zip(ranks,titles,times):
data = {
'rank':rank.get_text().strip(),
'singer':title.get_text().split('-')[0],
'song':title.get_text().split('-')[0],
'time':time.get_text().strip()
}
print(data)

if __name__ == '__main__':
urls = ['http://www.kugou.com/yy/rank/home/{}-8888.html'.format(str(i)) for i in range(1,24)]
for url in urls:
get_info(url)
time.sleep(1)