機器學習資料集篇——MNIST資料集

阿新 • • 發佈：2018-11-22

MNIST資料集是一個大型的手寫體數字資料庫，通常用於訓練各種影象處理系統，也被廣泛用於機器學習領域的訓練和測試。MNIST資料庫中的影象集是NIST（National Institute of Standards and Technology）的兩個資料庫的組合：專用資料庫1和特殊資料庫3。資料集是有250人手寫數字組成，50%是高中學生，50%是美國人口普查局。
MNIST資料集分為60,000張的訓練資料集合10,000張的測試資料集，每張影象的大小為28x28（畫素）；每張影象都為灰度影象，位深度為8（灰度影象是0-255）。

一、MNIST的下載

1.手動下載
下載地址：http://yann.lecun.com/exdb/mnist/

MNIST資料集包含4個檔案，下載四個壓縮檔案，解壓縮。解壓縮後發現這些檔案並不是標準的影象格式。這些影象資料都儲存在二進位制檔案中。train檔案是訓練資料集，t10k是測試資料集，images檔案是影象檔案，lables檔案是對應的標籤檔案。
train-images-idx3-ubyte.gz: training set images (9912422 bytes)
train-labels-idx1-ubyte.gz: training set labels (28881 bytes)
t10k-images-idx3-ubyte.gz: test set images (1648877 bytes)
t10k-labels-idx1-ubyte.gz: test set labels (4542 bytes)

2.使用tensorflow下載

from tensorflow.examples.tutorials.mnist import input_data
# 下載MNIST資料集
mnist = input_data.read_data_sets('/tmp/', one_hot=True)
# 數字(label)只能是0-9，神經網路使用10個出口節點就可以編碼表示0-9；
# /tmp是macOS的臨時目錄，重啟系統資料丟失; Linux的臨時目錄也是/tmp

二、資料的讀取

1.單張影象的讀取
MNIST的影象大小是28x28，我們先讀取train-images影象中的第一張影象。
備註：因為images檔案頭部有4個integer的型別，需要跳過去；

import numpy as np
import struct
import cv2
import matplotlib.pyplot as plt

# 解壓後的檔案，先取train中的第一張手寫影象
binfile = open('train-images.idx3-ubyte' , 'rb')
buf = binfile.read()

index = 0
magic,numImages,numRows,numColumns = struct.unpack_from('>IIII',buf,index)
index += struct.calcsize('>IIII')

im = struct.unpack_from('>784B' ,buf, index)  #28x28=784
index += struct.calcsize('>784B')
im = np.reshape(im,(28,28))
  
# 顯示第一張影象
fig = plt.figure()
plotwindow = fig.add_subplot(111)
plt.imshow(im,cmap='gray')
plt.show()

2.多張影象的讀取
讀取了100張的t10k的測試影象和標籤，並且顯示和儲存到資料夾。

import numpy as np
import struct
import matplotlib.pyplot as plt
import cv2

def readfile():
    binfile1 = open('t10k-images.idx3-ubyte' , 'rb')
    buf1 = binfile1.read()
    binfile2 = open('t10k-labels.idx1-ubyte' , 'rb')
    buf2 = binfile2.read()
    return buf1, buf2

def get_image(buf1):
    image_index = 0
    image_index += struct.calcsize('>IIII')
    magic,numImages,imgRows,imgCols=struct.unpack_from(">IIII",buf1,0)
    im = []
    for i in range(100):
        temp = struct.unpack_from('>784B', buf1, image_index) 
        im.append(np.reshape(temp,(28,28)))
        image_index += struct.calcsize('>784B')  
    return im

def get_label(buf2):
    label_index = 0
    label_index += struct.calcsize('>II')
    return struct.unpack_from('>100B', buf2, label_index)

if __name__ == "__main__":
    image_data, label_data = readfile()
    im1 = get_image(image_data)
    label = get_label(label_data)

    for i in range(100):
        plt.subplot(10, 10, i + 1)
        title = str(label[i])
        plt.title(title)
        plt.imshow(im1[i], cmap='gray')
        cv2.imwrite("\\testIM"+str(i)+".jpg",im1[i])
    plt.show()

多張影象的結果如下圖所示：

在這裡插入圖片描述

三、在tensorflow中的使用

MNIST資料集在機器學習方面已經被廣泛應用，表現還算出色，比如說在MNIST上採用Softmax迴歸訓練，在MNIST上使用CNN做視覺化訓練等等。在tensorflow上MNIST可以直接調取，只需要匯入input_data.py這個檔案就可以，不需要對其進行二進位制檔案轉為影象檔案的步驟，使用tensorflow.contrib.learn中的read_data_sets來載入資料就可以啦（FLAGS.data_dir是MNIST所在路徑），程式碼如下：

import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data
mnist = input_data.read_data_sets(FLAGS.data_dir,one_hot=True)

MNIST資料集就介紹到這裡啦，MNIST資料集算是機器學習的入門資料集，在分類等問題上有很出色的表現！以後會陸續的出一些資料集的介紹和分享~請大家持續關注哦！蟹蟹大家！

機器學習資料集篇——MNIST資料集

一、MNIST的下載

二、資料的讀取

三、在tensorflow中的使用

機器學習資料集篇——MNIST資料集

機器學習第2篇：資料預處理（缺失值）

機器學習第3篇：資料預處理（使用插補法處理缺失值）

機器學習第4篇：資料預處理（sklearn 插補缺失值）

[機器學習] 3: TensorFlow練習+MNIST手寫資料集+softmax實驗（未完待續）

機器學習保險行業問答開放資料集: 2. 使用案例

機器學習保險行業問答開放資料集：1.語料介紹

Andrew Ng 機器學習筆記 15 ：大資料集梯度下降

新手學習使用TensorFlow訓練MNIST資料集

《Spark機器學習》筆記——基於MovieLens資料集使用Spark進行電影資料分析

機器學習工具之交叉驗證資料集自動劃分train_test_split

【機器學習】模型訓練前夜—資料集預處理（概念+圖+實戰）

Google機器學習（二）鳶尾花資料集（load_iris）決策樹

機器學習與AI相關的資料

[機器學習python實踐(5)]Sklearn實現集成

機器學習入門-載入sklearn中資料並用matplotlib進行視覺化

python關聯分析 __機器學習之FP-growth頻繁項集演算法

MyBatis學習——第三篇（資料批量處理）

機器學習之FP-growth頻繁項集演算法

機器學習之FP-growth頻繁項集算法

機器學習資料集篇——MNIST資料集

一、MNIST的下載

二、資料的讀取

三、在tensorflow中的使用

相關推薦