1. 程式人生 > >Python網路爬蟲與資訊提取-Day14-(例項)股票資料定向爬蟲

Python網路爬蟲與資訊提取-Day14-(例項)股票資料定向爬蟲

功能描述

目標:獲取上交所和深交所所有股票的名稱和交易資訊

股票資料是進行量化交易的基礎型資料,此爬蟲也能為量化交易提供獲得基礎資料的方法

輸出:儲存到檔案中

技術路線:requestsbs4re

候選資料網站的選擇

百度股票:https://gupiao.baidu.com/stock/

選取原則:股票資訊靜態存在於HTML頁面中,js程式碼生成

沒有Robots協議限制

選取方法:瀏覽器 F12,原始碼檢視等

選取心態:不要糾結於某個網站,多找資訊源嘗試

資料網站的確定

新浪股票在頁面上看到的股票程式碼在原始碼中並沒有,說明很可能是由JavaScript指令碼生成的;而百度股票的每一支個股的資訊都寫在

HTML程式碼中

所以對於這兩個網站來講,百度股票更適合作為定向爬蟲的資料來源

獲取股票列表:

東方財富網:http://quote.eastmoney.com/stocklist.html

獲取個股資訊:

百度股票:https://gupiao.baidu.com/stock/

程式的結構設計

步驟1:從東方財富網獲取股票列表

步驟2:根據股票列表逐個到百度股票獲取個股資訊

步驟3:將結果儲存到檔案

 

百度股票原始碼中個股資訊的組織形式

所以鍵值對,用字典型別

例項編寫

為了除錯方便,使用traceback

import requests
from bs4 import BeautifulSoup
import traceback
import re

def getHTMLText(url):
    try:
        r = requests.get(url)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""

def getStockList(lst, stockURL):
    html = getHTMLText(stockURL)
    soup = BeautifulSoup(html, 'html.parser') 
    a = soup.find_all('a')
    for i in a:
        try:
            href = i.attrs['href']
            lst.append(re.findall(r"[s][hz]\d{6}", href)[0])
        except:
            continue

def getStockInfo(lst, stockURL, fpath):
    for stock in lst:
        url = stockURL + stock + ".html"
        html = getHTMLText(url)
        try:
            if html=="":
                continue
            infoDict = {}
            soup = BeautifulSoup(html, 'html.parser')
            stockInfo = soup.find('div',attrs={'class':'stock-bets'})

            name = stockInfo.find_all(attrs={'class':'bets-name'})[0]
            infoDict.update({'股票名稱': name.text.split()[0]})
            
            keyList = stockInfo.find_all('dt')
            valueList = stockInfo.find_all('dd')
            for i in range(len(keyList)):
                key = keyList[i].text
                val = valueList[i].text
                infoDict[key] = val
            
            with open(fpath, 'a', encoding='utf-8') as f:
                f.write( str(infoDict) + '\n' )
        except:
            traceback.print_exc()
            continue

def main():
    stock_list_url = 'http://quote.eastmoney.com/stocklist.html'
    stock_info_url = 'https://gupiao.baidu.com/stock/'
    output_file = 'D:/BaiduStockInfo.txt'
    slist=[]
    getStockList(slist, stock_list_url)
    getStockInfo(slist, stock_info_url, output_file)

main()
 


例項優化

如何提高使用者體驗?

速度提高:編碼識別的優化

r.apparent_encoding需要分析文字,執行較慢,可輔助人工分析

體驗提高:增加動態進度顯示

import requests
from bs4 import BeautifulSoup
import traceback
import re

def getHTMLText(url, code="utf-8"):
    try:
        r = requests.get(url)
        r.raise_for_status()
        r.encoding = code
        return r.text
    except:
        return ""

def getStockList(lst, stockURL):
    html = getHTMLText(stockURL, "GB2312")
    soup = BeautifulSoup(html, 'html.parser') 
    a = soup.find_all('a')
    for i in a:
        try:
            href = i.attrs['href']
            lst.append(re.findall(r"[s][hz]\d{6}", href)[0])
        except:
            continue

def getStockInfo(lst, stockURL, fpath):
    count = 0
    for stock in lst:
        url = stockURL + stock + ".html"
        html = getHTMLText(url)
        try:
            if html=="":
                continue
            infoDict = {}
            soup = BeautifulSoup(html, 'html.parser')
            stockInfo = soup.find('div',attrs={'class':'stock-bets'})

            name = stockInfo.find_all(attrs={'class':'bets-name'})[0]
            infoDict.update({'股票名稱': name.text.split()[0]})
            
            keyList = stockInfo.find_all('dt')
            valueList = stockInfo.find_all('dd')
            for i in range(len(keyList)):
                key = keyList[i].text
                val = valueList[i].text
                infoDict[key] = val
            
            with open(fpath, 'a', encoding='utf-8') as f:
                f.write( str(infoDict) + '\n' )
                count = count + 1
                print("\r當前進度: {:.2f}%".format(count*100/len(lst)),end="")
        except:
            count = count + 1
            print("\r當前進度: {:.2f}%".format(count*100/len(lst)),end="")
            continue

def main():
    stock_list_url = 'http://quote.eastmoney.com/stocklist.html'
    stock_info_url = 'https://gupiao.baidu.com/stock/'
    output_file = 'D:/BaiduStockInfo.txt'
    slist=[]
    getStockList(slist, stock_list_url)
    getStockInfo(slist, stock_info_url, output_file)

main()


總結

採用requestsbs4re路線實現了股票資訊爬取和儲存

實現了展示爬取程序的動態滾動條