1. 程式人生 > >MNIST資料集的格式以及讀取方式

MNIST資料集的格式以及讀取方式

MNIST 網站
http://yann.lecun.com/exdb/mnist/

四個檔案

train-images-idx3-ubyte.gz:  training set images (9912422 bytes) 
train-labels-idx1-ubyte.gz:  training set labels (28881 bytes) 
t10k-images-idx3-ubyte.gz:   test set images (1648877 bytes) 
t10k-labels-idx1-ubyte.gz:   test set labels (4542 bytes)

下下來後 解壓

$ gunzip *.gz

t10k-images-idx3-ubyte
train-images-idx3-ubyte
t10k-labels-idx1-ubyte
train-labels-idx1-ubyte

解壓後會生成上面的四個檔案

檔案的格式

There are 4 files:

train-images-idx3-ubyte: training set images 
train-labels-idx1-ubyte: training set labels 
t10k-images-idx3-ubyte:  test set images 
t10k-labels-idx1-ubyte:  test
set labels The training set contains 60000 examples, and the test set 10000 examples. The first 5000 examples of the test set are taken from the original NIST training set. The last 5000 are taken from the original NIST test set. The first 5000 are cleaner and easier than the last 5000. TRAINING SET LABEL FILE (train-labels-idx1-ubyte)
: [offset] [type] [value] [description] 0000 32 bit integer 0x00000801(2049) magic number (MSB first) 0004 32 bit integer 60000 number of items 0008 unsigned byte ?? label 0009 unsigned byte ?? label ........ xxxx unsigned byte ?? label The labels values are 0 to 9. TRAINING SET IMAGE FILE (train-images-idx3-ubyte): [offset] [type] [value] [description] 0000 32 bit integer 0x00000803(2051) magic number 0004 32 bit integer 60000 number of images 0008 32 bit integer 28 number of rows 0012 32 bit integer 28 number of columns 0016 unsigned byte ?? pixel 0017 unsigned byte ?? pixel ........ xxxx unsigned byte ?? pixel Pixels are organized row-wise. Pixel values are 0 to 255. 0 means background (white), 255 means foreground (black). TEST SET LABEL FILE (t10k-labels-idx1-ubyte): [offset] [type] [value] [description] 0000 32 bit integer 0x00000801(2049) magic number (MSB first) 0004 32 bit integer 10000 number of items 0008 unsigned byte ?? label 0009 unsigned byte ?? label ........ xxxx unsigned byte ?? label The labels values are 0 to 9. TEST SET IMAGE FILE (t10k-images-idx3-ubyte): [offset] [type] [value] [description] 0000 32 bit integer 0x00000803(2051) magic number 0004 32 bit integer 10000 number of images 0008 32 bit integer 28 number of rows 0012 32 bit integer 28 number of columns 0016 unsigned byte ?? pixel 0017 unsigned byte ?? pixel ........ xxxx unsigned byte ?? pixel Pixels are organized row-wise. Pixel values are 0 to 255. 0 means background (white), 255 means foreground (black).

影象檔案的前16個位元組是頭,包含了4個位元組的幻數,4個位元組表示影象數量,4個位元組表示單個影象的行數,4個位元組表示單個影象的列數.
標記檔案的前8個位元組是頭,包含了4個位元組的幻數,4個位元組表示標記數量.

下面讀取檔案

from __future__ import division                                                 
from __future__ import print_function                                           
                                                                                
#gunzip *.gz                                                                    
#http://yann.lecun.com/exdb/mnist/                                              
                                                                                
import os                                                                       
import sys                                                                      
import struct                                                                   
                                                                                
file_list = [                                                                   
            "train-images-idx3-ubyte",                                          
            "train-labels-idx1-ubyte",                                          
            "t10k-images-idx3-ubyte",                                           
            "t10k-labels-idx1-ubyte",                                           
            ]                                                                   
                                                                                
def create_path(path):                                                          
    if not os.path.isdir(path):                                                 
        os.makedirs(path)                                                       
                                                                                
def get_file_full_name(path, name):                                             
    create_path(path)                                                           
    if path[-1] == "/":                                                         
        full_name = path +  name                                                
    else:                                                                       
        full_name = path + "/" +  name                                          
    return full_name                                                            
                                                                                
def read_mnist(file_name):                                                             
    file_path = "/home/your/data/path"                                         
    full_path = get_file_full_name(file_path, file_name)                        
    file_object = open(full_path, 'rb')  #python3 need rb  python2 r is ok      
    return file_object
    
 def get_file_header_data(file_name, header_len, unpack_str):                    
    f = read_mnist(file_name)                                                   
    raw_header = f.read(header_len)                                             
    header_data = struct.unpack(unpack_str, raw_header)                         
    return header_data                                                          
                                                                                
def show_images_file_header(file_name):                                         
    show_file_header(file_name, 16, ">4I")                                      
                                                                                
def show_labels_file_header(file_name):                                         
    show_file_header(file_name, 8, ">2I")                                       
                                                                                
def show_file_header(file_name, header_len, unpack_str):                        
    header_data = get_file_header_data(file_name, header_len, unpack_str)       
    print("%s header data:%s" % (file_name, header_data))                       
                                                                                
def show_mnist_file_header():                                                   
    train_images_file_name = file_list[0]                                       
    show_images_file_header(train_images_file_name)                             
                                                                                
    test_images_file_name = file_list[2]                                        
    show_images_file_header(test_images_file_name)                              
                                                                                
    train_labels_file_name = file_list[1]                                       
    show_labels_file_header(train_labels_file_name)                             
                                                                                
    test_labels_file_name = file_list[3]                                        
    show_labels_file_header(test_labels_file_name)                              
                                                                                
def run():                                                                      
    show_mnist_file_header()                                                    
                                                                                
run()              

輸出

train-images-idx3-ubyte header data:(2051, 60000, 28, 28)
t10k-images-idx3-ubyte header data:(2051, 10000, 28, 28)
train-labels-idx1-ubyte header data:(2049, 60000)
t10k-labels-idx1-ubyte header data:(2049, 10000)

下面我問讀取一張圖片 並且展示一張圖片和它的標記

from __future__ import division                                                 
from __future__ import print_function                                           
                                                                                
#gunzip *.gz                                                                    
#http://yann.lecun.com/exdb/mnist/                                              
                                                                                
import os                                                                       
import sys                                                                      
import struct                                                                   
import numpy as np                                                              
import matplotlib.pyplot as plt                                                 
from PIL import Image                                                           
                                                                                
file_list = [                                                                   
            "train-images-idx3-ubyte",                                          
            "train-labels-idx1-ubyte",                                          
            "t10k-images-idx3-ubyte",                                           
            "t10k-labels-idx1-ubyte",                                           
            ]                                                                   
                                                                                
def create_path(path):                                                          
    if not os.path.isdir(path):                                                 
        os.makedirs(path)                                                       
                                                                                
def get_file_full_name(path, name):                                             
    create_path(path)                                                           
    if path[-1] == "/":                                                         
        full_name = path +  name                                                
    else:                                                                       
        full_name = path + "/" +  name                                          
    return full_name                                                            
                                                                                
def read_mnist(file_name):                                                             
    file_path = "/home/your/data/path"                                         
    full_path = get_file_full_name(file_path, file_name)                        
    file_object = open(full_path, 'rb')  #python3 need rb  python2 r is ok         
    return file_object                                                          
                                                                                
def get_file_header_data(file_obj, header_len, unpack_str):                     
    raw_header = file_obj.read(header_len)                                      
    header_data = struct.unpack(unpack_str, raw_header)                         
    return header_data     
def show_images_file_header(file_name):                                         
    show_file_header(file_name, 16, ">4I")                                      
                                                                                
def show_labels_file_header(file_name):                                         
    show_file_header(file_name, 8, ">2I")                                       
                                                                                
def show_file_header(file_name, header_len, unpack_str):                        
    file_obj = read_mnist(file_name)                                            
    header_data = get_file_header_data(file_obj, header_len, unpack_str)        
    show_file_header_data(file_name, header_data)                               
    file_obj.close()                                                            
                                                                                
def show_mnist_file_header():                                                   
    train_images_file_name = file_list[0]                                       
    show_images_file_header(train_images_file_name)                             
                                                                                
    test_images_file_name = file_list[2]                                        
    show_images_file_header(test_images_file_name)                              
                                                                                
    train_labels_file_name = file_list[1]                                       
    show_labels_file_header(train_labels_file_name)                             
                                                                                
    test_labels_file_name = file_list[3]                                        
    show_labels_file_header(test_labels_file_name)                              
                                                                                
def read_a_image(file_object):                                                  
    img = file_object.read(28*28)                                               
    tp = struct.unpack(">784B",img)                                             
    image = np.asarray(tp)                                                      
    image = image.reshape((28,28))                                              
    #image = image.astype(np.float64)                                           
    plt.imshow(image,cmap = plt.cm.gray)                                        
    plt.show()                                                                  
                                                                                
def read_a_label(file_object):                                                  
    img = file_object.read(1)                                                   
    tp = struct.unpack(">B",img)                                                
    print("the label is :%s" % tp[0])                                           
                                                                                
def show_file_header_data(file_name,header_data):                               
    print("%s header data:%s" % (file_name, header_data))
    
def show_a_image():                                                             
    images_file_name = file_list[0]                                             
    labels_file_name = file_list[1]                                             
    images_file = read_mnist(images_file_name)                                  
    header_data = get_file_header_data(images_file, 16, ">4I")                  
    show_file_header_data(images_file_name, header_data)                        
                                                                                
    labels_file = read_mnist(labels_file_name)                                  
    header_data = get_file_header_data(labels_file, 8, ">2I")                   
    show_file_header_data(labels_file_name, header_data)                        
                                                                                
    read_a_image(images_file)                                                   
    read_a_label(labels_file)                                                   
                                                                                                                                                                       
def run():                                                                      
    #show_mnist_file_header()                                                   
    show_a_image()       
                                                                                                                                                                                          
run()

輸出

train-images-idx3-ubyte header data:(2051, 60000, 28, 28)
train-labels-idx1-ubyte header data:(2049, 60000)
the label is :5

然後圖片
在這裡插入圖片描述

恩 圖片和標記一樣是5

然後我們修改成能自動生成批資料

from __future__ import division    
from __future__ import print_function    
    
#gunzip *.gz    
#http://yann.lecun.com/exdb/mnist/    
    
import os    
import sys    
import struct    
import numpy as np    
import matplotlib.pyplot as plt    
from PIL import Image    
    
file_list = [    
            "train-images-idx3-ubyte",    
            "train-labels-idx1-ubyte",    
            "t10k-images-idx3-ubyte",    
            "t10k-labels-idx1-ubyte",    
            ]    
    
def show_images_file_header(file_name):    
    show_file_header(file_name, 16, ">4I")    
    
def show_labels_file_header(file_name):    
    show_file_header(file_name, 8, ">2I")    
    
def show_file_header(file_name, header_len, unpack_str):    
    file_obj = read_mnist(file_name)    
    header_data = get_file_header_data(file_obj, header_len, unpack_str)        
    show_file_header_data(file_name, header_data)    
    file_obj.close()  
    
def show_mnist_file_header():    
    train_images_file_name = file_list[0]    
    show_images_file_header(train_images_file_name)    

    test_images_file_name = file_list[2]
    show_images_file_header(test_images_file_name)

    train_labels_file_name = file_list[1]
    show_labels_file_header(train_labels_file_name)

    test_labels_file_name = file_list[3]
    show_labels_file_header(test_labels_file_name)

def show_a_image(file_object):
    image = read_a_image(images_file)
    image = np.asarray(tp)
    image = image.reshape((28,28))
    plt.imshow(image,cmap = plt.cm.gray)
    plt.show()

def show_a_lebel(file_object):
    tp = read_a_label(file_object)
    print("the label is :%s" % tp)

def show_file_header_data(file_name,header_data):
    print("%s header data:%s" % (file_name, header_data))

def show_a_image():
    images_file_name = file_list[0]
    labels_file_name = file_list[1]
    images_file = read_mnist(images_file_name)
    header_data = get_file_header_data(images_file, 16, ">4I")
    show_file_header_data(images_file_name, header_data)

    labels_file = read_mnist(labels_file_name)
    header_data = get_file_header_data(labels_file, 8, ">2I")
    show_file_header_data(labels_file_name, header_data)
    
    show_a_image(images_file)
    read_a_label(labels_file)

def create_path(path):
    if not os.path.isdir(path):
        os.makedirs(path)

def get_file_full_name(path, name):
    create_path(path)
    if path[-1] == "/":
        full_name = path +  name
    else:
        full_name = path + "/" +  name
    return full_name

def read_mnist(file_name):         
    file_path = "/home/your/data/path"
    full_path = get_file_full_name(file_path, file_name)
    file_object = open(full_path, 'rb')  #python3 need rb  python2 r is ok      
    return file_object

def get_file_header_data(file_obj, header_len, unpack_str):
    raw_header = file_obj.read(header_len)
    header_data = struct.unpack(unpack_str, raw_header)
    return header_data

def read_a_image(file_object):
    raw_img = file_object.read(28*28)
    img = struct.unpack(">784B",raw_img)
    return img

def read_a_label(file_object):
    raw_label = file_object.read(1)
    label = struct.unpack(">B",raw_label)
    return label
def generate_a_batch(images_file_name,labels_file_name,batch_size=8):
    images_file = read_mnist(images_file_name)
    header_data = get_file_header_data(images_file, 16, ">4I")
    #show_file_header_data(images_file_name, header_data)

    labels_file = read_mnist(labels_file_name)
    header_data = get_file_header_data(labels_file, 8, ">2I")
    #show_file_header_data(labels_file_name, header_data)

    while True:
        images = []
        labels = []
        for i in range(100):
            try:
                image = read_a_image(images_file)
                label = read_a_label(labels_file)
                images.append(image)
                labels.append(label)
            except Exception as err:
                print(err)
                break
        yield images,labels

def get_train_data_generator():
    images_file_name = file_list[0]
    labels_file_name = file_list[1]
    gennerator = generate_a_batch(images_file_name,labels_file_name)
    return gennerator-

def get_test_data_generator():
    images_file_name = file_list[2]
    labels_file_name = file_list[3]
    gennerator = generate_a_batch(images_file_name,labels_file_name)
    return gennerator
    
def get_test_data_generator():
    images_file_name = file_list[2]
    labels_file_name = file_list[3]
    gennerator = generate_a_batch(images_file_name,labels_file_name)
    return gennerator-

def get_a_batch(data_generator):
    if sys.version >'3':
        batch_img, batch_labels = data_generator.__next__()
    else:
        batch_img, batch_labels = data_generator.next()
    return batch_img,batch_labels

def generate_test_batch():
    data_generator = get_test_data_generator()
    count = 1
    while count:
        batch_img,batch_labels = get_a_batch(data_generator)
        if not batch_img and not batch_labels:
            break
        batch_img = np.array(batch_img)
        batch_labels = np.array(batch_labels)
        print("img shape:%s label shape:%s count:%s" %(batch_img.shape,batch_labels.shape,count))
        count +=1
        
def generate_train_batch():
    epoch 
            
           

相關推薦

MNIST資料格式以及讀取方式

MNIST 網站 http://yann.lecun.com/exdb/mnist/ 四個檔案 train-images-idx3-ubyte.gz: training set images (9912422 bytes) train-labels-idx1-ubyte.gz

cifar10資料格式以及讀取方式

cifar10 資料網站 http://www.cs.toronto.edu/~kriz/cifar.html 讀取下面的檔案 CIFAR-10 binary version (suitable for C programs) 162 MB c32a1d4ab5d03f1284b

MNIST資料格式ubyte轉png

MNIST資料集是ubyte格式儲存的,現在轉化為png格式: 訓練集: import numpy as np import struct from PIL import Image import os data_file = 'train-images-idx3

numpy方法讀取載入mnist資料

方法來自機器之心公眾號 首先下載mnist資料集,並將裡面四個資料夾解壓出來,下載方法見前面的部落格 import tensorflow as tf import numpy as np import os dataset_path = r'D:\PycharmProjects\ten

讀取mnist資料顯示圖片資訊

MNIST資料集下載地址https://download.csdn.net/download/weixin_33595571/10826617 QQ群:476842922(歡迎加群討論學習) import numpy as np import struct import matplotlib

使用 Java 讀取 MNIST 資料

使用 Java 讀取 Mnist 資料集 0. 前言 好久沒寫 blog 了,沒有堅持住,心中滿滿的負罪感!!! 上週一時衝動了,決定自己 code 一下 mlp (多層感知機)。最後的測試部分使用它來識別手寫數字,也就是在 MNIST 資料集上訓練並測試效果。在讀取 MNI

Mnist資料以及input_data.py的程式碼

Mnist作為tensorflow的入門,但是很多人都在Mnist的資料集上就已經卡住了。有的人找不到input_data.pyde程式碼。所以在此給那些找不到input_data.py的人提供程式碼。僅供學習。原始碼來自於https://tensorflow.

神經網路模型的儲存和讀取(基於Mnist資料)

#Import MNIST data from tensorflow.examples.tutorials.mnist import input_data mnist = input_data.read_data_sets("data/",one_hot=True) impo

MNIST 資料讀取和視覺化

MNIST 資料集已經是一個被”嚼爛”了的資料集, 很多教程都會對它”下手”, 幾乎成為一個 “典範”. 不過有些人可能對它還不是很瞭解, 下面來介紹一下.Training set images: train-images-idx3-ubyte.gz (9.9 MB, 解壓後

C++ —— 讀取MNIST資料資料並轉存為影象

       在上一個部落格中,我們已經對MNIST資料集的資料格式有了一定的瞭解,這裡我們要完成的工作是將讀到的資料轉成圖片,存入資料夾中,以便日後使用。在開始之前,我們先對該資料庫的儲存格式進行一個具體的介紹:MNIST(Mixed National Institute

機器學習Tensorflow基於MNIST資料識別自己的手寫數字(讀取和測試自己的模型)

更新: 以下為原博: 廢話不多說,先上效果圖 整體來看,效果是非常不錯的,模型的訓練,參照官方程式碼mnist_deep.py,準確率是高達99.2% 那麼,我是怎麼實現的呢? 一.讀懂卷積神經網路程式碼(至少得把程式跑通) 首先參照Tensorfl

MNIST資料手寫體識別(MLP實現)

github部落格傳送門 csdn部落格傳送門 本章所需知識: 沒有基礎的請觀看深度學習系列視訊 tensorflow Python基礎 資料下載連結: 深度學習基礎網路模型(mnist手寫體識別資料集) MNIST資料集手寫體識別(MLP實現) import tensorflow

MNIST資料手寫體識別(CNN實現)

github部落格傳送門 csdn部落格傳送門 本章所需知識: 沒有基礎的請觀看深度學習系列視訊 tensorflow Python基礎 資料下載連結: 深度學習基礎網路模型(mnist手寫體識別資料集) MNIST資料集手寫體識別(CNN實現) import tensorflow

MNIST資料手寫體識別(RNN實現)

github部落格傳送門 csdn部落格傳送門 本章所需知識: 沒有基礎的請觀看深度學習系列視訊 tensorflow Python基礎 資料下載連結: 深度學習基礎網路模型(mnist手寫體識別資料集) MNIST資料集手寫體識別(CNN實現) import tensorflow

MNIST資料手寫體識別(SEQ2SEQ實現)

github部落格傳送門 csdn部落格傳送門 本章所需知識: 沒有基礎的請觀看深度學習系列視訊 tensorflow Python基礎 資料下載連結: 深度學習基礎網路模型(mnist手寫體識別資料集) MNIST資料集手寫體識別(CNN實現) import tensorflow

kears搭建神經網路分類mnist資料

from keras.datasets import mnist from keras import models from keras import layers from keras.utils import to_categorical from keras.optimizers im

pytorch:實現簡單的GAN(MNIST資料

# -*- coding: utf-8 -*- """ Created on Sat Oct 13 10:22:45 2018 @author: www """ import torch from torch import nn from torch.autograd import Vari

COCO資料格式互換

poly->compacted RLE:     seg=np.array([312.29, 562.89, 402.25, 511.49, 400.96, 425.38, 398.39, 372.69, 388.11, 332.85, 318.71, 325.14, 295

神經網路實現Mnist資料簡單分類

本文針對mnist手寫數字集,搭建了四層簡單的神經網路進行圖片的分類,詳細心得記錄下來分享 我是採用的TensorFlow框架進行的訓練 import tensorflow as tf from tensorflow.examples.tutorials.mnist import in

http請求資料格式以及格式

http請求報包含三個部分: 請求行 + 請求頭 + 資料體 請求行包含三個內容 method + request-URI + http-version method 包含有 post , get, head,delete, put,