python爬蟲由淺入深9---定向爬取股票資料資訊並儲存至本地檔案
阿新 • • 發佈:2018-12-30
技術路線:requests庫+bs4庫+re庫的整合使用
目標:獲得上交所和深交所所有股票的名稱和交易資訊
輸出:儲存至本地檔案
可選資料網路有:新浪股票和百度股票,,通過檢視網頁原始碼可知,新浪股票的資料是通過javascript指令碼獲取的,故通過以上方式無法解析
呃呃呃,可以說requests庫+bs4庫+re庫可以爬的網站應該是---資訊靜態存在於HTML頁面中,非js程式碼生成,沒有Robots協議限制
所以最終確定了資料來源為:東方財富網+百度股票
東方財富網:
百度股票:
程式結構設計:
1.從東方財富網中獲取股票列表
2.根據股票列表逐個到百度股票獲取個股資訊
3.將結果存至檔案
封裝函式,編寫程式碼
import requests from bs4 import BeautifulSoup import traceback import re def getHTMLText(url,code='utf-8'): try: r = requests.get(url,timeout = 30) r.raise_for_status() r.encoding = code return r.text except: return "" def getStockList(lst,stockURL): html = getHTMLText(stockURL,'GB2312') soup = BeautifulSoup(html,'html.parser') a = soup.find_all('a') for i in a: try: href = i.attrs['href'] lst.append(re.findall(r"[s][hz]\d{6}",href)[0]) except: continue def getStockInfo(lst,stockURL,fpath): count = 0 for stock in lst: url = stockURL + stock + ".html" html = getHTMLText(url) try: if html == "": continue infoDict = { } soup = BeautifulSoup(html,'html.parser') stockInfo = soup.find('div',attrs={'class':'stock-bets'}) name = stockInfo.find_all(attrs={'class':'bets-name'})[0] infoDict.update({'股票名稱':name.text.split()[0]}) keyList = stockInfo.find_all('dt') valueList = stockInfo.find_all('dd') for i in range(len(keyList)): key = keyList[i].text val = valueList[i].text infoDict[key] = val with open(fpath,'a',encoding = 'utf-8') as f: f.write(str(infoDict) + '\n') except: traceback.print_exc() continue return "" def main(): stock_list_url = 'http://quote.eastmoney.com/stocklist.html' stock_info_url = 'https://gupiao.baidu.com/stock/' output_file = 'C://Users//kfc//Desktop//BaiduStockInfo.txt' slist = [] getStockList(slist,stock_list_url) getStockInfo(slist,stock_info_url,output_file) main()