1. 程式人生 > >MNIST機器學習數據集

MNIST機器學習數據集

下載 代碼 -i 特征 mac 解釋 使用方法 ges code

介紹

在學習機器學習的時候,首當其沖的就是準備一份通用的數據集,方便與其他的算法進行比較。在這裏,我寫了一個用於加載MNIST數據集的方法,並將其進行封裝,主要用於將MNIST數據集轉換成numpy.array()格式的訓練數據。直接下面看下面的代碼吧(主要還是如何用python去讀取binnary file)!

MNIST數據集原網址:http://yann.lecun.com/exdb/mnist/

Github源碼下載:數據集(源文件+解壓文件+字體圖像jpg格式), py源碼文件

文件目錄

/utils/data_util.py 用於加載MNIST數據集方法文件

/utils/test.py 用於測試的文件,一個簡單的KNN測試MNIST數據集

/data/train-images.idx3-ubyte 訓練集X

/dataset/train-labels.idx1-ubyte 訓練集y

/dataset/data/t10k-images.idx3-ubyte 測試集X

/dataset/data/t10k-labels.idx1-ubyte 測試集y

MNIST數據集解釋

將MNIST文件解壓後,發現這些文件並不是標準的圖像格式。這些圖像數據都保存在二進制文件中。每個樣本圖像的寬高為28*28。

mnist的結構如下,選取train-images

[code]TRAINING SET IMAGE FILE (train-images-idx3-ubyte):

[offset] [type]          [value]          [description] 
0000     32 bit integer  0x00000803(2051) magic number 
0004     32 bit integer  60000            number of images 
0008     32 bit integer  28               number of rows 
0012     32 bit integer  28               number of columns 
0016     unsigned byte   ??               pixel 
0017     unsigned byte   ??               pixel 
........ 
xxxx     unsigned byte   ??               pixel



首先該數據是以二進制存儲的,我們讀取的時候要以’rb’方式讀取;其次,真正的數據只有[value]這一項,其他的[type]等只是來描述的,並不真正在數據文件裏面。也就是說,在讀取真實數據之前,我們要讀取4個

32 bit integer

.由[offset]我們可以看出真正的pixel是從0016開始的,一個int 32位,所以在讀取pixel之前我們要讀取4個 32 bit integer,也就是magic number, number of images, number of rows, number of columns. 當然,在這裏使用struct.unpack_from()會比較方便.

源碼

說明:

‘>IIII’指的是使用大端法讀取4個unsinged int 32 bit integer

‘>784B’指的是使用大端法讀取784個unsigned byte

data_util.py文件

[code]# -*- coding: utf-8 -*-
"""
Created on Thu Feb 25 14:40:06 2016
load MNIST dataset
@author: liudiwei
"""
import numpy as np 
import struct
import matplotlib.pyplot as plt 
import os

class DataUtils(object):
    """MNIST數據集加載
    輸出格式為:numpy.array()    

    使用方法如下
    from data_util import DataUtils
    def main():
        trainfile_X = ‘../dataset/MNIST/train-images.idx3-ubyte‘
        trainfile_y = ‘../dataset/MNIST/train-labels.idx1-ubyte‘
        testfile_X = ‘../dataset/MNIST/t10k-images.idx3-ubyte‘
        testfile_y = ‘../dataset/MNIST/t10k-labels.idx1-ubyte‘

        train_X = DataUtils(filename=trainfile_X).getImage()
        train_y = DataUtils(filename=trainfile_y).getLabel()
        test_X = DataUtils(testfile_X).getImage()
        test_y = DataUtils(testfile_y).getLabel()

        #以下內容是將圖像保存到本地文件中
        #path_trainset = "../dataset/MNIST/imgs_train"
        #path_testset = "../dataset/MNIST/imgs_test"
        #if not os.path.exists(path_trainset):
        #    os.mkdir(path_trainset)
        #if not os.path.exists(path_testset):
        #    os.mkdir(path_testset)
        #DataUtils(outpath=path_trainset).outImg(train_X, train_y)
        #DataUtils(outpath=path_testset).outImg(test_X, test_y)

        return train_X, train_y, test_X, test_y 
    """

    def __init__(self, filename=None, outpath=None):
        self._filename = filename
        self._outpath = outpath

        self._tag = ‘>‘
        self._twoBytes = ‘II‘
        self._fourBytes = ‘IIII‘    
        self._pictureBytes = ‘784B‘
        self._labelByte = ‘1B‘
        self._twoBytes2 = self._tag + self._twoBytes
        self._fourBytes2 = self._tag + self._fourBytes
        self._pictureBytes2 = self._tag + self._pictureBytes
        self._labelByte2 = self._tag + self._labelByte

    def getImage(self):
        """
        將MNIST的二進制文件轉換成像素特征數據
        """
        binfile = open(self._filename, ‘rb‘) #以二進制方式打開文件
        buf = binfile.read() 
        binfile.close()
        index = 0
        numMagic,numImgs,numRows,numCols=struct.unpack_from(self._fourBytes2,                                                                    buf,                                                                    index)
        index += struct.calcsize(self._fourBytes)
        images = []
        for i in range(numImgs):
            imgVal = struct.unpack_from(self._pictureBytes2, buf, index)
            index += struct.calcsize(self._pictureBytes2)
            imgVal = list(imgVal)
            for j in range(len(imgVal)):
                if imgVal[j] > 1:
                    imgVal[j] = 1
            images.append(imgVal)
        return np.array(images)

    def getLabel(self):
        """
        將MNIST中label二進制文件轉換成對應的label數字特征
        """
        binFile = open(self._filename,‘rb‘)
        buf = binFile.read()
        binFile.close()
        index = 0
        magic, numItems= struct.unpack_from(self._twoBytes2, buf,index)
        index += struct.calcsize(self._twoBytes2)
        labels = [];
        for x in range(numItems):
            im = struct.unpack_from(self._labelByte2,buf,index)
            index += struct.calcsize(self._labelByte2)
            labels.append(im[0])
        return np.array(labels)

    def outImg(self, arrX, arrY):
        """
        根據生成的特征和數字標號,輸出png的圖像
        """
        m, n = np.shape(arrX)
        #每張圖是28*28=784Byte
        for i in range(1):
            img = np.array(arrX[i])
            img = img.reshape(28,28)
            outfile = str(i) + "_" +  str(arrY[i]) + ".png"
            plt.figure()
            plt.imshow(img, cmap = ‘binary‘) #將圖像黑白顯示
            plt.savefig(self._outpath + "/" + outfile)



test.py文件:簡單地測試了一下KNN算法,代碼如下

[code]# -*- coding: utf-8 -*-
"""
Created on Thu Feb 25 16:09:58 2016
Test MNIST dataset 
@author: liudiwei
"""

from sklearn import neighbors  
from data_util import DataUtils
import datetime  

def main():
    trainfile_X = ‘../dataset/MNIST/train-images.idx3-ubyte‘
    trainfile_y = ‘../dataset/MNIST/train-labels.idx1-ubyte‘
    testfile_X = ‘../dataset/MNIST/t10k-images.idx3-ubyte‘
    testfile_y = ‘../dataset/MNIST/t10k-labels.idx1-ubyte‘
    train_X = DataUtils(filename=trainfile_X).getImage()
    train_y = DataUtils(filename=trainfile_y).getLabel()
    test_X = DataUtils(testfile_X).getImage()
    test_y = DataUtils(testfile_y).getLabel()

    return train_X, train_y, test_X, test_y 

def testKNN():
    train_X, train_y, test_X, test_y = main()
    startTime = datetime.datetime.now()
    knn = neighbors.KNeighborsClassifier(n_neighbors=3)  
    knn.fit(train_X, train_y)  
    match = 0;  
    for i in xrange(len(test_y)):  
        predictLabel = knn.predict(test_X[i])[0]  
        if(predictLabel==test_y[i]):  
            match += 1  

    endTime = datetime.datetime.now()  
    print ‘use time: ‘+str(endTime-startTime)  
    print ‘error rate: ‘+ str(1-(match*1.0/len(test_y)))  

if __name__ == "__main__":
    testKNN()



通過main方法,最後直接返回numpy.array()格式的數據:train_X, train_y, test_X, test_y。如果你需要,直接條用main方法即可!

更多機器學習文章請進:http://www.csuldw.com.

MNIST機器學習數據集