1. 程式人生 > >例項操作:Python提取雅虎財經資料,並做資料分析和視覺化

例項操作:Python提取雅虎財經資料,並做資料分析和視覺化

第一步、獲取資料

股市資料可以從Yahoo! Finance、 Google Finance以及國內的新浪財經等地方拿到。同時,pandas包提供了輕鬆從以上網站獲取資料的方法。

import pandas as pd            # as 是對包或模組重新命名
import pandas.io.data as web   # 匯入包和模組,模組可能隨著版本的不同會發生變化
import datetime

ImportError: The pandas.io.data module is moved to a separate package (pandas-datareader). 
After installing the pandas-datareader package (https://github.com/pydata/pandas-datareader),
 you can change the import ``from pandas.io import data, wb`` to ``from pandas_datareader import data, wb``.

pandas.io.data模組已換成獨立的pandas-datareader包,所以上述程式碼會報錯。

所以,在cmd命令列中執行以下語句即可解決這個問題:(安裝pandas-datareader包

pip install pandas_datareader

所以新程式碼為下:

import pandas as pd
import pandas_datareader.data as web   
import datetime
start = datetime.datetime(2016,1,1)
end = datetime.date.today()
apple = web.DataReader("AAPL", "yahoo", start, end)

#  web.DataReader("AAPL", "yahoo", start, end)會出現下處錯誤
# Let's get Apple stock data; Apple's ticker symbol is AAPL
# First argument is the series we want, second is the source ("yahoo" for Yahoo! Finance), third is the start date, fourth is the end date

會輸出如下結果:
ImmediateDeprecationError: 
Yahoo Daily has been immediately deprecated due to large breaks in the API without the
introduction of a stable replacement. Pull Requests to re-enable these data
connectors are welcome.

上述出現錯誤的原因是因為雅虎在中國受限制的原因,所以再一次修改程式碼, 這裡我們需要引入另外一個模組‘fix_yahoo_finance’,同樣使用pip方法在cmd命令中進行安裝。 

pip install fix_yahoo_finance

 所以,程式碼如下:


import pandas_datareader.data as web
import datetime
import fix_yahoo_finance as yf
yf.pdr_override()
 
start=datetime.datetime(2006, 1, 1)
end=datetime.datetime(2012, 1, 1)
apple=web.get_data_yahoo('AAPL',start,end)
apple
	
apple.head()

或者
start=datetime.datetime(2017, 1, 1)
end=datetime.datetime.today()
apple=web.get_data_yahoo('AAPL',start,end)
apple
Out[21]: 
                  Open        High         Low       Close   Adj Close  \
Date                                                                     
2017-01-03  115.800003  116.330002  114.760002  116.150002  113.013916   
2017-01-04  115.849998  116.510002  115.750000  116.019997  112.887413   
2017-01-05  115.919998  116.860001  115.809998  116.610001  113.461502   
2017-01-06  116.779999  118.160004  116.470001  117.910004  114.726402   
2017-01-09  117.949997  119.430000  117.940002  118.989998  115.777237   
2017-01-10  118.769997  119.379997  118.300003  119.110001  115.893997   

第二步、對提取的資料視覺化:

import matplotlib.pyplot as plt   # Import matplotlib
# This line is necessary for the plot to appear in a Jupyter notebook
%matplotlib inline
 
# Control the default size of figures in this Jupyter notebook
%pylab inline
 
pylab.rcParams['figure.figsize'] = (15, 9)   # Change the size of plots
 
apple["Adj Close"].plot(grid = True) # Plot the adjusted closing price of AAPL

第三步、畫

線段圖是可行的,但是每一天的資料至少有四個變數(開市,股票最高價,股票最低價和閉市),我們希望找到一種不需要我們畫四條不同的線就能看到這四個變數走勢的視覺化方法。一般來說我們使用燭柱圖(也稱為日本陰陽燭圖表)來視覺化金融資料,燭柱圖最早在18世紀被日本的稻米商人所使用。可以用matplotlib來作圖,但是需要費些功夫。

你們可以使用我實現的一個函式更容易地畫燭柱圖,它接受pandas的data frame作為資料來源。(程式基於這個例子, 你可以從這裡找到相關函式的文件。)

from matplotlib.dates import DateFormatter, WeekdayLocator,\
    DayLocator, MONDAY
from matplotlib.finance import candlestick_ohlc
 
def pandas_candlestick_ohlc(dat, stick = "day", otherseries = None):
    """
    :param dat: pandas DataFrame object with datetime64 index, and float columns "Open", "High", "Low", and "Close", likely created via DataReader from "yahoo"
    :param stick: A string or number indicating the period of time covered by a single candlestick. Valid string inputs include "day", "week", "month", and "year", ("day" default), and any numeric input indicates the number of trading days included in a period
    :param otherseries: An iterable that will be coerced into a list, containing the columns of dat that hold other series to be plotted as lines
 
    This will show a Japanese candlestick plot for stock data stored in dat, also plotting other series if passed.
    """
    mondays = WeekdayLocator(MONDAY)        # major ticks on the mondays
    alldays = DayLocator()              # minor ticks on the days
    dayFormatter = DateFormatter('%d')      # e.g., 12
 
    # Create a new DataFrame which includes OHLC data for each period specified by stick input
    transdat = dat.loc[:,["Open", "High", "Low", "Close"]]
    if (type(stick) == str):
        if stick == "day":
            plotdat = transdat
            stick = 1 # Used for plotting
        elif stick in ["week", "month", "year"]:
            if stick == "week":
                transdat["week"] = pd.to_datetime(transdat.index).map(lambda x: x.isocalendar()[1]) # Identify weeks
            elif stick == "month":
                transdat["month"] = pd.to_datetime(transdat.index).map(lambda x: x.month) # Identify months
            transdat["year"] = pd.to_datetime(transdat.index).map(lambda x: x.isocalendar()[0]) # Identify years
            grouped = transdat.groupby(list(set(["year",stick]))) # Group by year and other appropriate variable
            plotdat = pd.DataFrame({"Open": [], "High": [], "Low": [], "Close": []}) # Create empty data frame containing what will be plotted
            for name, group in grouped:
                plotdat = plotdat.append(pd.DataFrame({"Open": group.iloc[0,0],
                                            "High": max(group.High),
                                            "Low": min(group.Low),
                                            "Close": group.iloc[-1,3]},
                                           index = [group.index[0]]))
            if stick == "week": stick = 5
            elif stick == "month": stick = 30
            elif stick == "year": stick = 365
 
    elif (type(stick) == int and stick >= 1):
        transdat["stick"] = [np.floor(i / stick) for i in range(len(transdat.index))]
        grouped = transdat.groupby("stick")
        plotdat = pd.DataFrame({"Open": [], "High": [], "Low": [], "Close": []}) # Create empty data frame containing what will be plotted
        for name, group in grouped:
            plotdat = plotdat.append(pd.DataFrame({"Open": group.iloc[0,0],
                                        "High": max(group.High),
                                        "Low": min(group.Low),
                                        "Close": group.iloc[-1,3]},
                                       index = [group.index[0]]))
 
    else:
        raise ValueError('Valid inputs to argument "stick" include the strings "day", "week", "month", "year", or a positive integer')
 
 
    # Set plot parameters, including the axis object ax used for plotting
    fig, ax = plt.subplots()
    fig.subplots_adjust(bottom=0.2)
    if plotdat.index[-1] - plotdat.index[0] < pd.Timedelta('730 days'):
        weekFormatter = DateFormatter('%b %d')  # e.g., Jan 12
        ax.xaxis.set_major_locator(mondays)
        ax.xaxis.set_minor_locator(alldays)
    else:
        weekFormatter = DateFormatter('%b %d, %Y')
    ax.xaxis.set_major_formatter(weekFormatter)
 
    ax.grid(True)
 
    # Create the candelstick chart
    candlestick_ohlc(ax, list(zip(list(date2num(plotdat.index.tolist())), plotdat["Open"].tolist(), plotdat["High"].tolist(),
                      plotdat["Low"].tolist(), plotdat["Close"].tolist())),
                      colorup = "black", colordown = "red", width = stick * .4)
 
    # Plot other series (such as moving averages) as lines
    if otherseries != None:
        if type(otherseries) != list:
            otherseries = [otherseries]
        dat.loc[:,otherseries].plot(ax = ax, lw = 1.3, grid = True)
 
    ax.xaxis_date()
    ax.autoscale_view()
    plt.setp(plt.gca().get_xticklabels(), rotation=45, horizontalalignment='right')
 
    plt.show()
 
pandas_candlestick_ohlc(apple)

燭狀圖中黑色線條代表該交易日閉市價格高於開市價格(盈利),紅色線條代表該交易日開市價格高於閉市價格(虧損)。刻度線代表當天交易的最高價和最低價(影線用來指明燭身的哪端是開市,哪端是閉市)。燭狀圖在金融和技術分析中被廣泛使用在交易決策上,利用燭身的形狀,顏色和位置。 

第四步、把多個公司股票繪製在一張圖上

當然,燭柱圖不能繪製多個股票,需要繪製線段圖在一個圖上。

獲取microsoft公司股價:

import pandas_datareader.data as web
import datetime
import fix_yahoo_finance as yf
yf.pdr_override()
 
start=datetime.datetime(2017, 1, 1)
end=datetime.datetime.today()
microsoft=web.get_data_yahoo('MSFT',start,end)
microsoft
Out[1]: 
                 Open       High        Low      Close  Adj Close    Volume
Date                                                                       
2017-01-03  62.790001  62.840000  62.130001  62.580002  60.431488  20694100
2017-01-04  62.480000  62.750000  62.119999  62.299999  60.161095  21340000
2017-01-05  62.189999  62.660000  62.029999  62.299999  60.161095  24876000
2017-01-06  62.299999  63.150002  62.040001  62.840000  60.682560  19922900
2017-01-09  62.759998  63.080002  62.540001  62.639999  60.489429  20256600

獲取google公司股價:

import pandas_datareader.data as web
import datetime
import fix_yahoo_finance as yf
yf.pdr_override()
 
start=datetime.datetime(2017, 1, 1)
end=datetime.datetime.today()
google=web.get_data_yahoo('GOOG',start,end)
google.head()


                  Open        High         Low       Close   Adj Close  \
Date                                                                     
2017-01-03  778.809998  789.630005  775.799988  786.140015  786.140015   
2017-01-04  788.359985  791.340027  783.159973  786.900024  786.900024   
2017-01-05  786.080017  794.479980  785.020020  794.020020  794.020020   
2017-01-06  795.260010  807.900024  792.203979  806.150024  806.150024   
2017-01-09  806.400024  809.966003  802.830017  806.650024  806.650024   

             Volume  
Date                 
2017-01-03  1657300  
2017-01-04  1073000  
2017-01-05  1335200  
2017-01-06  1640200  
2017-01-09  1272400  

把apple、Microsoft和Google三家股價的從2017年1月1日到現在的Adj Close值合在一起

import pandas_datareader.data as web
import datetime
import fix_yahoo_finance as yf
yf.pdr_override()

start=datetime.datetime(2017, 1, 1)
end=datetime.datetime.today()

apple=web.get_data_yahoo('AAPL',start,end)
microsoft=web.get_data_yahoo('MSFT',start,end)
google=web.get_data_yahoo('GOOG',start,end)

import pandas as pd
stocks = pd.DataFrame({"AAPL": apple["Adj Close"],
                      "MSFT": microsoft["Adj Close"],
                      "GOOG": google["Adj Close"]})   
stocks     # adj close就是等於adjusted close

Out[8]: 
                  AAPL         GOOG        MSFT
Date                                           
2017-01-03  113.013916   786.140015   60.431488
2017-01-04  112.887413   786.900024   60.161095
2017-01-05  113.461502   794.020020   60.161095
2017-01-06  114.726402   806.150024   60.682560
2017-01-09  115.777237   806.650024   60.489429

繪圖:

stocks.plot(grid = True)

這張圖表的問題在哪裡呢?雖然價格的絕對值很重要(昂貴的股票很難購得,這不僅會影響它們的波動性,也會影響你交易它們的難易度),但是在交易中,我們更關注每支股票價格的變化而不是它的價格。Google的股票價格比蘋果微軟的都高,這個差別讓蘋果和微軟的股票顯得波動性很低,而事實並不是那樣。

一個解決辦法就是用兩個不同的標度來作圖。一個標度用於蘋果和微軟的資料;另一個標度用來表示Google的資料。

stocks.plot(secondary_y = ["AAPL", "MSFT"], grid = True)

一個“更好”的解決方法是視覺化我們實際關心的資訊:股票的收益。這需要我們進行必要的資料轉化。資料轉化的方法很多。其中一個轉化方法是將每個交易日的股票交個跟比較我們所關心的時間段開始的股票價格相比較。也就是:

text{return}_{t,0} = frac{text{price}_t}{text{price}_0}

這需要轉化stock物件中的資料,操作如下:

# df.apply(arg) will apply the function arg to each column in df, and return a DataFrame with the result
# Recall that lambda x is an anonymous function accepting parameter x; in this case, x will be a pandas Series object
stock_return = stocks.apply(lambda x: x / x[0])
stock_return.head()

Out[11]: 
                AAPL      GOOG      MSFT
Date                                    
2017-01-03  1.000000  1.000000  1.000000
2017-01-04  0.998881  1.000967  0.995526
2017-01-05  1.003960  1.010024  0.995526
2017-01-06  1.015153  1.025453  1.004155
2017-01-09  1.024451  1.026090  1.000959
stock_return.plot(grid = True).axhline(y = 1, color = "black", lw = 2)

這個圖就有用多了。現在我們可以看到從我們所關心的日期算起,每支股票的收益有多高。而且我們可以看到這些股票之間的相關性很高。它們基本上朝同一個方向移動,在其他型別的圖表中很難觀察到這一現象。

我們還可以用每天的股值變化作圖。一個可行的方法是我們使用後一天$t + 1$和當天$t$的股值變化佔當天股價的比例:

text{growth}_t = frac{text{price}_{t + 1} - text{price}_t}{text{price}_t}

我們也可以比較當天跟前一天的價格:

text{increase}_t = frac{text{price}_{t} - text{price}_{t-1}}{text{price}_t}

以上的公式並不相同,可能會讓我們得到不同的結論,但是我們可以使用對數差異來表示股票價格變化:

text{change}_t = log(text{price}_{t}) - log(text{price}_{t - 1})

(這裡的log是自然對數,我們的定義不完全取決於使用log(text{price}_{t}) - log(text{price}_{t - 1})還是log(text{price}_{t+1}) - log(text{price}_{t}).) 使用對數差異的好處是該差異值可以被解釋為股票的百分比差異,但是不受分母的影響。

下面的程式碼演示瞭如何計算和視覺化股票的對數差異:

# Let's use NumPy's log function, though math's log function would work just as well
import numpy as np 
stock_change = stocks.apply(lambda x: np.log(x) - np.log(x.shift(1))) # shift moves dates back by 1.
stock_change.head()
Out[14]: 
                AAPL      GOOG      MSFT
Date                                    
2017-01-03       NaN       NaN       NaN
2017-01-04 -0.001120  0.000966 -0.004484
2017-01-05  0.005073  0.009007  0.000000
2017-01-06  0.011087  0.015161  0.008630
2017-01-09  0.009118  0.000620 -0.003188
stock_change.plot(grid = True).axhline(y = 0, color = "black", lw = 2)

你更傾向於哪種轉換方法呢?從相對時間段開始日的收益差距可以明顯看出不同證券的總體走勢。不同交易日之間的差距被用於更多預測股市行情的方法中,它們是不容被忽視的。

移動平均值

圖表非常有用。在現實生活中,有些交易人在做決策的時候幾乎完全基於圖表(這些人是“技術人員”,從圖表中找到規律並制定交易策略被稱作技術分析,它是交易的基本教義之一。)下面讓我們來看看如何找到股票價格的變化趨勢。

一個q天的移動平均值(用MA^q_t來表示)定義為:對於某一個時間點t,它之前q天的平均值。

MA^q_t = frac{1}{q} sum_{i = 0}^{q-1} x_{t - i}

移動平均值可以讓一個系列的資料變得更平滑,有助於我們找到趨勢。q值越大,移動平均對短期的波動越不敏感。移動平均的基本目的就是從噪音中識別趨勢。快速的移動平均有偏小的q,它們更接近股票價格;而慢速的移動平均有較大的q值,這使得它們對波動不敏感從而更加穩定。

pandas提供了計算移動平均的函式。下面我將演示使用這個函式來計算蘋果公司股票價格的20天(一個月)移動平均值,並將它跟股票價格畫在一起。

import pandas as pd  # 不加這個,會提示NameError: name 'pd' is not defined
pandas_candlestick_ohlc(apple)
apple["20d"] = np.round(apple["Close"].rolling(window = 20, center = False).mean(), 2)
pandas_candlestick_ohlc(apple.loc['2017-01-01':'2017-08-07',:], otherseries = "20d")

注意到平均值的起始點時間是很遲的。我們必須等到20天之後才能開始計算該值。這個問題對於更長時間段的移動平均來說是個更加嚴重的問題。因為我希望我可以計算200天的移動平均,我將擴充套件我們所得到的蘋果公司股票的資料,但我們主要還是隻關注2016。

import pandas_datareader.data as web
import datetime
import fix_yahoo_finance as yf
yf.pdr_override()

start = datetime.datetime(2010,1,1)
apple=web.get_data_yahoo('AAPL',start,end)
apple["20d"] = np.round(apple["Close"].rolling(window = 20, center = False).mean(), 2)
 
pandas_candlestick_ohlc(apple.loc['2016-01-04':'2016-08-07',:], otherseries = "20d")

你會發現移動平均比真實的股票價格資料平滑很多。而且這個指數是非常難改變的:一支股票的價格需要變到平局值之上或之下才能改變移動平均線的方向。因此平均線的交叉點代表了潛在的趨勢變化,需要加以注意。

交易者往往對不同的移動平均感興趣,例如20天,50天和200天。要同時生成多條移動平均線也不難:

apple["50d"] = np.round(apple["Close"].rolling(window = 50, center = False).mean(), 2)
apple["200d"] = np.round(apple["Close"].rolling(window = 200, center = False).mean(), 2)
 
pandas_candlestick_ohlc(apple.loc['2016-01-04':'2016-08-07',:], otherseries = ["20d", "50d", "200d"])

20天的移動平均線對小的變化非常敏感,而200天的移動平均線波動最小。這裡的200天平均線顯示出來總體的熊市趨勢:股值總體來說一直在下降。