1. 程式人生 > >python解析 MS-Celeb-1M 資料庫

python解析 MS-Celeb-1M 資料庫

我在微軟下載人臉識別資料庫 MS-Celeb-1M, 下載完的格式是 .tsv格式。看了資料庫官網的檔案格式說明:

File format: text files, each line is an image record containing 7 columns, delimited by TAB.
Column1: Freebase MID
Column2: ImageSearchRank
Column3: ImageURL
Column4: PageURL
Column5: FaceID
Column6: FaceRectangle_Base64Encoded (four floats, relative coordinates of UpperLeft and BottomRight corner)
Column7: FaceData_Base64Encoded

我決定用python件來解析這個.tsv檔案,獲取圖片及相應的人臉資訊。以下是解析程式碼:

import base64
import struct
import os

def readline(line):
    MID,ImageSearchRank,ImageURL,PageURL,FaceID,FaceRectangle,FaceData=line.split("\t")
    rect=struct.unpack("ffff",base64.b64decode(FaceRectangle))
    return MID,ImageSearchRank,ImageURL,PageURL,FaceID,rect,base64.b64decode(FaceData)

def
writeImage(filename,data):
with open(filename,"wb") as f: f.write(data) def unpack(filename,target="img"): i=0 with open(filename,"r",encoding="utf-8") as f: for line in f: MID,ImageSearchRank,ImageURL,PageURL,FaceID,FaceRectangle,FaceData=readline(line) img_dir=os.path.join(target,MID) if
not os.path.exists(img_dir): os.mkdir(img_dir) img_name="%s-%s"%(ImageSearchRank,FaceID)+".jpg" writeImage(os.path.join(img_dir,img_name),FaceData) i+=1 if i%1000==0: print(i,"imgs finished") print("all finished") filename="MsCelebV1-Faces-Aligned.tsv" unpack(filename)

tsv檔案和csv檔案是類似的,tsv的資料用Tab鍵分隔,csv檔案用逗號分隔。根據說明檔案,把每一行的資訊讀取出來

MID,ImageSearchRank,ImageURL,PageURL,FaceID,FaceRectangle,FaceData=line.split("\t")

人臉矩形框的位置資訊是(左上點座標,右下點座標),用Base64編碼的,格式為4個浮點數。

rect=struct.unpack("ffff",base64.b64decode(FaceRectangle))

人臉資訊是用Base64編碼的,需要解碼,然後儲存為影象:

data=base64.b64decode(FaceData)
with open(filename,"wb") as f:
    f.write(data)