Python網路爬蟲與資訊提取-Day14-(例項)股票資料定向爬蟲
阿新 • • 發佈:2019-01-03
功能描述
目標:獲取上交所和深交所所有股票的名稱和交易資訊
股票資料是進行量化交易的基礎型資料,此爬蟲也能為量化交易提供獲得基礎資料的方法
輸出:儲存到檔案中
技術路線:requests‐bs4‐re
候選資料網站的選擇
百度股票:https://gupiao.baidu.com/stock/
選取原則:股票資訊靜態存在於HTML頁面中,非js程式碼生成
沒有Robots協議限制
選取方法:瀏覽器 F12,原始碼檢視等
選取心態:不要糾結於某個網站,多找資訊源嘗試
資料網站的確定
新浪股票在頁面上看到的股票程式碼在原始碼中並沒有,說明很可能是由JavaScript指令碼生成的;而百度股票的每一支個股的資訊都寫在 HTML程式碼中
所以對於這兩個網站來講,百度股票更適合作為定向爬蟲的資料來源
獲取股票列表:
東方財富網:http://quote.eastmoney.com/stocklist.html
獲取個股資訊:
百度股票:https://gupiao.baidu.com/stock/
程式的結構設計
步驟1:從東方財富網獲取股票列表
步驟2:根據股票列表逐個到百度股票獲取個股資訊
步驟3:將結果儲存到檔案
百度股票原始碼中個股資訊的組織形式
所以鍵值對,用字典型別
例項編寫
為了除錯方便,使用traceback庫
import requests from bs4 import BeautifulSoup import traceback import re def getHTMLText(url): try: r = requests.get(url) r.raise_for_status() r.encoding = r.apparent_encoding return r.text except: return "" def getStockList(lst, stockURL): html = getHTMLText(stockURL) soup = BeautifulSoup(html, 'html.parser') a = soup.find_all('a') for i in a: try: href = i.attrs['href'] lst.append(re.findall(r"[s][hz]\d{6}", href)[0]) except: continue def getStockInfo(lst, stockURL, fpath): for stock in lst: url = stockURL + stock + ".html" html = getHTMLText(url) try: if html=="": continue infoDict = {} soup = BeautifulSoup(html, 'html.parser') stockInfo = soup.find('div',attrs={'class':'stock-bets'}) name = stockInfo.find_all(attrs={'class':'bets-name'})[0] infoDict.update({'股票名稱': name.text.split()[0]}) keyList = stockInfo.find_all('dt') valueList = stockInfo.find_all('dd') for i in range(len(keyList)): key = keyList[i].text val = valueList[i].text infoDict[key] = val with open(fpath, 'a', encoding='utf-8') as f: f.write( str(infoDict) + '\n' ) except: traceback.print_exc() continue def main(): stock_list_url = 'http://quote.eastmoney.com/stocklist.html' stock_info_url = 'https://gupiao.baidu.com/stock/' output_file = 'D:/BaiduStockInfo.txt' slist=[] getStockList(slist, stock_list_url) getStockInfo(slist, stock_info_url, output_file) main()
例項優化
如何提高使用者體驗?
速度提高:編碼識別的優化
r.apparent_encoding需要分析文字,執行較慢,可輔助人工分析
體驗提高:增加動態進度顯示
import requests from bs4 import BeautifulSoup import traceback import re def getHTMLText(url, code="utf-8"): try: r = requests.get(url) r.raise_for_status() r.encoding = code return r.text except: return "" def getStockList(lst, stockURL): html = getHTMLText(stockURL, "GB2312") soup = BeautifulSoup(html, 'html.parser') a = soup.find_all('a') for i in a: try: href = i.attrs['href'] lst.append(re.findall(r"[s][hz]\d{6}", href)[0]) except: continue def getStockInfo(lst, stockURL, fpath): count = 0 for stock in lst: url = stockURL + stock + ".html" html = getHTMLText(url) try: if html=="": continue infoDict = {} soup = BeautifulSoup(html, 'html.parser') stockInfo = soup.find('div',attrs={'class':'stock-bets'}) name = stockInfo.find_all(attrs={'class':'bets-name'})[0] infoDict.update({'股票名稱': name.text.split()[0]}) keyList = stockInfo.find_all('dt') valueList = stockInfo.find_all('dd') for i in range(len(keyList)): key = keyList[i].text val = valueList[i].text infoDict[key] = val with open(fpath, 'a', encoding='utf-8') as f: f.write( str(infoDict) + '\n' ) count = count + 1 print("\r當前進度: {:.2f}%".format(count*100/len(lst)),end="") except: count = count + 1 print("\r當前進度: {:.2f}%".format(count*100/len(lst)),end="") continue def main(): stock_list_url = 'http://quote.eastmoney.com/stocklist.html' stock_info_url = 'https://gupiao.baidu.com/stock/' output_file = 'D:/BaiduStockInfo.txt' slist=[] getStockList(slist, stock_list_url) getStockInfo(slist, stock_info_url, output_file) main()
總結
採用requests‐bs4‐re路線實現了股票資訊爬取和儲存
實現了展示爬取程序的動態滾動條