1. 程式人生 > >Python網路爬蟲之製作股票資料定向爬蟲 以及爬取的優化 可以顯示進度條!

Python網路爬蟲之製作股票資料定向爬蟲 以及爬取的優化 可以顯示進度條!

候選網站:
新浪股票:http://finance.sina.com.cn/stock/
百度股票:https://gupiao.baidu.com/stock/

選取原則:

  1. 無robots協議
  2. 非js網頁
  3. 資料在HTMLK頁面中的

F12,檢視原始碼,即可檢視。

新浪股票,使用JS製作。指令碼生成的資料。

百度股票可以在HTML中查詢到!

http://quote.eastmoney.com/stocklist.html

這個地址可以查詢股票詳細列表!
在這裡插入圖片描述

程式思路:

1. 獲取股票列表
2. 根據列表資訊到百度獲取個股資訊2,根據列表資訊到百度獲取個股資訊
3. 將結果儲存

考慮用字典作為資料容器進行儲存!

火狐瀏覽器可以檢視原始碼,藍色的IE瀏覽器就會出現亂碼:
火狐的:
在這裡插入圖片描述
因為a標籤,太多所以正則表示式匹配比較困難。
可用try except來解決!
在這裡插入圖片描述

[s]:表示s。[hz]:表示h z。後面是隨意6個數。
SH:
在這裡插入圖片描述
SZ:
在這裡插入圖片描述

優化:
在這裡插入圖片描述

  1. r.encoding:僅從頭部獲得
  2. r.apparent_encoding:是從全文獲得的。r.apparent_encoding:是從全文獲得的。

優化就是將編碼直接給程式碼,另外一個就是顯示進度。

下面就是程式碼部分啦:

最初的程式碼:(真長)

import requests
from bs4 import BeautifulSoup
import traceback
import re

def getHTMLText(url):
    try:
        r = requests.get(url, timeout = 30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""
def getStockList(lst, stockURL):
    html = getHTMLText(stockURL)
    soup = BeautifulSoup(html, 'html.parser')
    a = soup.find_all('a')
    for i in a:
        try:
            href = i.attrs['href']
            lst.append(re.findall(r"[s][hz]\d{6}",href)[0])
        except:
            continue
    
def getStockInfo(lst, stockURL, fpath):
    for stock in lst:
        url = stockURL + stock + ".html"
        html = getHTMLText(url)
        try:
            if html == "":
                continue
            infoDict = {}
            soup = BeautifulSoup(html, 'html.parser')
            stockInfo = soup.find('div', attrs={'class':'stock-bets'})
            name = stockInfo.find_all(attrs={'class':'bets-name'})[0]
            infoDict.update({'股票名稱':name.text.split()[0]})
            keyList = stockInfo.find_all('dt')
            valueList = stockInfo.find_all('dd')
            for i in range(len(keyList)):
                key = keyList[i].text
                val = valurList[i].text
                infoDict[key] = val

            with open(fpath, 'a', encoding='utf-8') as f:
                f.write(str(infoDict) + '\n')
        except:
            traceback.print_exc()
            continue
def main():
    stock_list_url = 'https://quote.eastmoney.com/stocklist.html'
    stock_info_url = 'https://gupiao.baidu.com/stock/'
    output_file = 'D:\234.txt'
    slist = []
    getStockList(slist, stock_list_url)
    getStockInfo(slist, stock_info_url, output_file)

main()

程式碼執行結果;
在這裡插入圖片描述

優化後的程式碼:

import requests
from bs4 import BeautifulSoup
import traceback
import re

def getHTMLText(url,code='utf-8'):#預設的是utf-8
    try:
        r = requests.get(url, timeout = 30)
        r.raise_for_status()
        r.encoding = code#直接賦值
        return r.text
    except:
        return ""
def getStockList(lst, stockURL):
    html = getHTMLText(stockURL,'GB2312')#已經查詢過啦!
    soup = BeautifulSoup(html, 'html.parser')
    a = soup.find_all('a')
    for i in a:
        try:
            href = i.attrs['href']
            lst.append(re.findall(r"[s][hz]\d{6}",href)[0])
        except:
            continue
    
def getStockInfo(lst, stockURL, fpath):
    count = 0
    for stock in lst:
        url = stockURL + stock + ".html"
        html = getHTMLText(url)
        try:
            if html == "":
                continue
            infoDict = {}
            soup = BeautifulSoup(html, 'html.parser')
            stockInfo = soup.find('div', attrs={'class':'stock-bets'})
            name = stockInfo.find_all(attrs={'class':'bets-name'})[0]
            infoDict.update({'股票名稱':name.text.split()[0]})
            keyList = stockInfo.find_all('dt')
            valueList = stockInfo.find_all('dd')
            for i in range(len(keyList)):
                key = keyList[i].text
                val = valurList[i].text
                infoDict[key] = val

            with open(fpath, 'a', encoding='utf-8') as f:
                f.write(str(infoDict) + '\n')
                count = count +1
                print('\r當前速度:{:.2f}%'.format(count*100/len(lst)),end=' ')
        except:
            count = count +1
            print('\r當前速度:{:.2f}%'.format(count*100/len(lst)),end=' ')
            traceback.print_exc()
            continue
def main():
    stock_list_url = 'https://quote.eastmoney.com/stocklist.html'
    stock_info_url = 'https://gupiao.baidu.com/stock/'
    output_file = 'D:\234.txt'
    slist = []
    getStockList(slist, stock_list_url)
    getStockInfo(slist, stock_info_url, output_file)

main()

提前給出了編碼方式以及可以顯示進度條的程式碼
給出編碼方式的程式碼:

def getHTMLText(url,code='utf-8'):#預設的是utf-8
    try:
        r = requests.get(url, timeout = 30)
        r.raise_for_status()
        r.encoding = code#直接賦值
        return r.text
    except:
        return ""
def getStockList(lst, stockURL):
    html = getHTMLText(stockURL,'GB2312')#已經查詢過啦!
    soup = BeautifulSoup(html, 'html.parser')
    a = soup.find_all('a')
    for i in a:
        try:
            href = i.attrs['href']
            lst.append(re.findall(r"[s][hz]\d{6}",href)[0])
        except:
            continue

照片:
(如果不是utf-8,就要提前給替換掉!)
在這裡插入圖片描述
可以顯示進度條的程式碼

def getStockInfo(lst, stockURL, fpath):
    count = 0
    for stock in lst:
        url = stockURL + stock + ".html"
        html = getHTMLText(url)
        try:
            if html == "":
                continue
            infoDict = {}
            soup = BeautifulSoup(html, 'html.parser')
            stockInfo = soup.find('div', attrs={'class':'stock-bets'})
            name = stockInfo.find_all(attrs={'class':'bets-name'})[0]
            infoDict.update({'股票名稱':name.text.split()[0]})
            keyList = stockInfo.find_all('dt')
            valueList = stockInfo.find_all('dd')
            for i in range(len(keyList)):
                key = keyList[i].text
                val = valurList[i].text
                infoDict[key] = val

            with open(fpath, 'a', encoding='utf-8') as f:
                f.write(str(infoDict) + '\n')
                count = count +1
                print('\r當前速度:{:.2f}%'.format(count*100/len(lst)),end=' ')
        except:
            count = count +1
            print('\r當前速度:{:.2f}%'.format(count*100/len(lst)),end=' ')
            traceback.print_exc()
            continue

照片:
在這裡插入圖片描述不過,顯示進度在IDLE那裡不可以顯示。
但是最後我也沒成功有檔案生成以及顯示進度條,算啦。先去吃飯啦~