Python網路爬蟲與資訊提取-Day14-（例項）股票資料定向爬蟲

阿新 • • 發佈：2019-01-03

功能描述

目標：獲取上交所和深交所所有股票的名稱和交易資訊

股票資料是進行量化交易的基礎型資料，此爬蟲也能為量化交易提供獲得基礎資料的方法

輸出：儲存到檔案中

技術路線：requests‐bs4‐re

候選資料網站的選擇

百度股票：https://gupiao.baidu.com/stock/

選取原則：股票資訊靜態存在於HTML頁面中，非js程式碼生成

沒有Robots協議限制

選取方法：瀏覽器 F12，原始碼檢視等

選取心態：不要糾結於某個網站，多找資訊源嘗試

資料網站的確定

新浪股票在頁面上看到的股票程式碼在原始碼中並沒有，說明很可能是由JavaScript指令碼生成的；而百度股票的每一支個股的資訊都寫在

HTML程式碼中

所以對於這兩個網站來講，百度股票更適合作為定向爬蟲的資料來源

獲取股票列表：

東方財富網：http://quote.eastmoney.com/stocklist.html

獲取個股資訊：

百度股票：https://gupiao.baidu.com/stock/

程式的結構設計

步驟1：從東方財富網獲取股票列表

步驟2：根據股票列表逐個到百度股票獲取個股資訊

步驟3：將結果儲存到檔案

百度股票原始碼中個股資訊的組織形式

所以鍵值對，用字典型別

例項編寫

為了除錯方便，使用traceback庫

import requests
from bs4 import BeautifulSoup
import traceback
import re

def getHTMLText(url):
    try:
        r = requests.get(url)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""

def getStockList(lst, stockURL):
    html = getHTMLText(stockURL)
    soup = BeautifulSoup(html, 'html.parser') 
    a = soup.find_all('a')
    for i in a:
        try:
            href = i.attrs['href']
            lst.append(re.findall(r"[s][hz]\d{6}", href)[0])
        except:
            continue

def getStockInfo(lst, stockURL, fpath):
    for stock in lst:
        url = stockURL + stock + ".html"
        html = getHTMLText(url)
        try:
            if html=="":
                continue
            infoDict = {}
            soup = BeautifulSoup(html, 'html.parser')
            stockInfo = soup.find('div',attrs={'class':'stock-bets'})

            name = stockInfo.find_all(attrs={'class':'bets-name'})[0]
            infoDict.update({'股票名稱': name.text.split()[0]})
            
            keyList = stockInfo.find_all('dt')
            valueList = stockInfo.find_all('dd')
            for i in range(len(keyList)):
                key = keyList[i].text
                val = valueList[i].text
                infoDict[key] = val
            
            with open(fpath, 'a', encoding='utf-8') as f:
                f.write( str(infoDict) + '\n' )
        except:
            traceback.print_exc()
            continue

def main():
    stock_list_url = 'http://quote.eastmoney.com/stocklist.html'
    stock_info_url = 'https://gupiao.baidu.com/stock/'
    output_file = 'D:/BaiduStockInfo.txt'
    slist=[]
    getStockList(slist, stock_list_url)
    getStockInfo(slist, stock_info_url, output_file)

main()

例項優化

如何提高使用者體驗?

速度提高：編碼識別的優化

r.apparent_encoding需要分析文字，執行較慢，可輔助人工分析

體驗提高：增加動態進度顯示

import requests
from bs4 import BeautifulSoup
import traceback
import re

def getHTMLText(url, code="utf-8"):
    try:
        r = requests.get(url)
        r.raise_for_status()
        r.encoding = code
        return r.text
    except:
        return ""

def getStockList(lst, stockURL):
    html = getHTMLText(stockURL, "GB2312")
    soup = BeautifulSoup(html, 'html.parser') 
    a = soup.find_all('a')
    for i in a:
        try:
            href = i.attrs['href']
            lst.append(re.findall(r"[s][hz]\d{6}", href)[0])
        except:
            continue

def getStockInfo(lst, stockURL, fpath):
    count = 0
    for stock in lst:
        url = stockURL + stock + ".html"
        html = getHTMLText(url)
        try:
            if html=="":
                continue
            infoDict = {}
            soup = BeautifulSoup(html, 'html.parser')
            stockInfo = soup.find('div',attrs={'class':'stock-bets'})

            name = stockInfo.find_all(attrs={'class':'bets-name'})[0]
            infoDict.update({'股票名稱': name.text.split()[0]})
            
            keyList = stockInfo.find_all('dt')
            valueList = stockInfo.find_all('dd')
            for i in range(len(keyList)):
                key = keyList[i].text
                val = valueList[i].text
                infoDict[key] = val
            
            with open(fpath, 'a', encoding='utf-8') as f:
                f.write( str(infoDict) + '\n' )
                count = count + 1
                print("\r當前進度: {:.2f}%".format(count*100/len(lst)),end="")
        except:
            count = count + 1
            print("\r當前進度: {:.2f}%".format(count*100/len(lst)),end="")
            continue

def main():
    stock_list_url = 'http://quote.eastmoney.com/stocklist.html'
    stock_info_url = 'https://gupiao.baidu.com/stock/'
    output_file = 'D:/BaiduStockInfo.txt'
    slist=[]
    getStockList(slist, stock_list_url)
    getStockInfo(slist, stock_info_url, output_file)

main()

總結

採用requests‐bs4‐re路線實現了股票資訊爬取和儲存

實現了展示爬取程序的動態滾動條

Python網路爬蟲與資訊提取-Day14-（例項）股票資料定向爬蟲

功能描述目標：獲取上交所和深交所所有股票的名稱和交易資訊股票資料是進行量化交易的基礎型資料，此爬蟲也能為量化交易提供獲得基礎資料的方法輸出：儲存到檔案中技術路線：requests‐bs4‐re 候選資料網站的選擇百度股票：https://gupiao.baidu

python爬蟲筆記（七）:實戰（三）股票資料定向爬蟲

目標分析及描述#CrawBaiduStocksA.py import requests from bs4 import BeautifulSoup import traceback import re def getHTMLText(url): try:

Python網路爬蟲與資訊提取_爬蟲例項（學習筆記）

慕課課程學習筆記 1. 京東商品頁面的爬取 1.採用get()方法，獲取Response物件； import requests url = 'https://item.jd.com/100000947807.html' r = requests.get(url)

Python網路爬蟲與資訊提取（三）bs4入門

Python的requests庫可以幫助我們獲取到大量的資訊，而如果想對這些資訊進行提取與分析，則經常使用beautifulsoup這個用來解析HTML和XML格式的功能庫。 beautifulsoup庫的安裝和requests的流方法一樣，可直接在cmd中輸入pip

Python網路爬蟲與資訊提取（中國大學mooc）

目錄 Python網路爬蟲與資訊提取淘寶商品比價定向爬蟲股票資料定向爬蟲 1. 淘寶商品比價定向爬蟲功能描述目標：獲取淘寶搜尋頁面的資訊理解：淘寶的搜尋介面翻頁的處理技術路線：requests

python網路爬蟲與資訊提取（四）Robots協議

Robots協議例項一京東例項二亞馬遜緒論網路爬蟲引發的問題1、網路爬蟲的尺寸爬取網頁 Requests庫爬取網站 Scrapy庫爬取全網建立搜尋引擎2、網路爬蟲引發的問題1.伺服器效能騷擾2.法律風險3.洩露隱私3、網路爬蟲的限制來源審查：判斷User-Agent

Python網路爬蟲與資訊提取（五）資訊標記與資訊提取的一般方法

目前國際公認的資訊標記種類共有如下三種：名稱方式例項XML(eXtensible Markup Language)基於HTML的用有名稱與屬性的標籤進行標記的方式<name>...</name> <name /> <!-

Python網路爬蟲與資訊提取Day2

Python網路爬蟲與資訊提取一、導學掌握定向網路資料爬取和網頁解析的基本能力 1、Requests庫：自動爬取HTML頁面，自動向網路提交請求 2、robots.txt：網路爬蟲排除標準 3、Beautiful Soup庫：解析HTML頁面 4、Projects：實戰專案A/B 5、Re庫：正

Python網路爬蟲與資訊提取Day1

Python 爬蟲基礎學習--網路爬蟲與資訊提取

Python 爬蟲基礎學習 Requests庫的安裝 Win平臺: “以管理員身份執行”cmd，執行 pip install requests Requests庫的7個主要的方法 Requests庫中2個重要的物件：Request和Response Response物件

嵩天教授的Python網路爬蟲與資訊提取課程筆記——單元1. requests庫入門

本文目錄 Requests庫介紹 requests.get(url, params, **kwargs)方法及其他請求方法介紹 Response類屬性簡介 Reponse類中的encoding與app

【MOOC】Python網路爬蟲與資訊提取-北京理工大學-part 4

網路爬蟲之框架 1.scrapy爬蟲框架介紹 1.1.scrapy爬蟲框架介紹安裝方法：簡要地說，Scrapy不是一個函式功能庫，而是一個快速功能強大的網路爬蟲框架。（爬蟲框架是實現爬蟲功能的一個軟體結構和功能元件集合，是一個半成品，

Python網路爬蟲與資訊提取-Day5-Requests庫網路爬取實戰

一、京東商品頁面的爬取先選取一個商品頁面直接利用之前的程式碼框架即可 import requests url = "https://item.jd.com/12186192.html" try: r = requests.get(url) r.raise_for

Python網路爬蟲與資訊提取-Day9-資訊標記與提取方法

一、資訊標記的三種形式我們需要對資訊進行表記，使得我們能夠理解資訊所反饋的真實含義。標記後的資訊可形成資訊組織結構，增加了資訊維度標記的結構與資訊一樣具有重要價值標記後的資訊可用於通訊、儲存或

【MOOC】Python網路爬蟲與資訊提取-北京理工大學-part 1

【第〇周】網路爬蟲之前奏網路爬蟲”課程內容導學【第一週】網路爬蟲之規則 1.Requests庫入門注意：中文文件的內容要稍微比英文文件的更新得慢一些，參考時需要關注兩種文件對應的Requests庫版本。（對於比較簡單的使

j記錄學習--python網路爬蟲與資訊提取

The website is the API...要獲取網站內容，只要把網站當成API就可以了。 requests庫獲取網頁資訊---》Beautiful Soup解析提取到資訊的內容---》利用re庫正則表示式提取其中某部分的關鍵資訊----》Scrapy*網路爬蟲網路

【MOOC】Python網路爬蟲與資訊提取-北京理工大學-part 3

【第三週】網路爬蟲之實戰一、Re(正則表示式)庫入門 1.正則表示式的概念 1.1正則表示式是什麼正則表示式是用來簡潔表達一組字串的表示式。使用正則表示式的優勢就是：簡潔、一行勝千言一行就是特徵(模式) 例1：代表一組字串：

【MOOC】Python網路爬蟲與資訊提取-北京理工大學-part 2

【第二週】網路爬蟲之提取 Beautiful Soup庫入門 Beautiful Soup庫的安裝與測試 <html><head><title>This is a python demo page<

Python網路爬蟲之製作股票資料定向爬蟲以及爬取的優化可以顯示進度條！

候選網站：新浪股票：http://finance.sina.com.cn/stock/ 百度股票：https://gupiao.baidu.com/stock/ 選取原則：無robots協議非js網頁資料在HTMLK頁面中的 F12，檢視原始

網路穿透與音視訊技術（4）——NAT對映檢測和常見網路穿越方法論（NAT檢測實踐1）

2.2、檢測過程實戰——伺服器端要進行NAT對映檢測，按照上文提到的檢測方式，我們就需要一個服務端檢測程式。並將服務端檢測程式部署到具有兩個外網IP的硬體環境下。 2.2.1、檢測要求服務端程式至少需要做到以下功能：檢測客戶端和當前伺服器端之間是否至

Python網路爬蟲與資訊提取-Day14-（例項）股票資料定向爬蟲

相關推薦