機器學習中樣本非隨機分佈時,建立train val test 等檔案過程
阿新 • • 發佈:2018-11-01
上一篇blog寫過一個把訓練的樣本按指定比例隨機分配個學習過程,具體見:
https://blog.csdn.net/lingyunxianhe/article/details/81837978
這樣做前提是你的類別在樣本中是隨機或更科學的說是均勻分佈的,而不是一個類別集中與連續的某一段資料中,這樣你隨機產生樣本就有可能使得train val test 分配的很不好
因為自己手動標記的資料,有時為了方便標記,同一個類別的圖片可能比較集中,我這有同一個類別樣本在連續超過500張圖片中佔到80%以上,因此為了當分配train val test時合理,在此把連續多張圖片中某個類別的樣本量佔很大比重時,記錄在一個txt檔案中(用個程式把圖片名寫入txt檔案即可),然後對這個txt集合按train val(我這裡是test固定)分配比例分配這個小集合,如果test不固定那還要按train val test比例分配這個小集合,最後把這些小集合整合到一個大集合中去即可,具體程式碼如下:
#!/usr/bin/python # -*- coding: UTF-8 -*- # 2018/08/11 by DQ import os import random MidFolder='py-faster-rcnn' MainFolder=os.path.join('/home/KingMe/project',MidFolder,'data/FABdevkit2017/FAB2017/ImageSets/Main') AnotFolder=os.path.join('/home/KingMe/project',MidFolder,'data/FABdevkit2017/FAB2017/Annotations') fileIdLen=6 # CurImNum=len(os.listdir(AnotFolder)) ######################last start############################# def CreateImIdTxt(ImIdS,FilePath): if os.path.exists(FilePath): os.remove(FilePath) with open(FilePath,'w') as FId: for ImId in ImIdS: ImIdStr=str(ImId).zfill(fileIdLen)+'\n' FId.writelines(ImIdStr) #獲取指定txt文件記錄的圖片集合名 def GetPointTxtImIdSet(FilePath): ImIdSet=[] if os.path.exists(FilePath): with open(FilePath) as FId: TxtList=FId.readlines() #print TxtList for TxtStr in TxtList: ImId=TxtStr.split() ImIdSet.append(int(ImId[0])) return ImIdSet def AssignImIdSetAsRatio(ImIdSet,TrainR): random.shuffle(ImIdSet) ImNum=len(ImIdSet) TrainNum=int(TrainR*ImNum) TrainImId=ImIdSet[:TrainNum-1] ValImId=list(set(ImIdSet).difference(set(TrainImId))) return TrainImId,ValImId def WriteImIdSet2TrainValTxt(TrainImId,ValImId,TrainValImId): TrainImId.sort() ValImId.sort() TrainValImId.sort() TrainValTestIds={} TrainValTestIds['train']=TrainImId TrainValTestIds['val']=ValImId TrainValTestIds['trainval']=TrainValImId TrainValTestFiles={'train':'train.txt','val':'val.txt','trainval':'trainval.txt'} for Key,KeyVal in TrainValTestFiles.iteritems(): print 'start create '+ Key+' ImSet' ImIdS=TrainValTestIds[Key] FileName=TrainValTestFiles[Key] FilePath=os.path.join(MainFolder,FileName) CreateImIdTxt(ImIdS,FilePath) def FixTestDeassignTrainVal(): TrainR=0.7 SubFolder='TestSetOrOtherBackup' FileName='test.txt'#測試集合固定,我這裡有兩個類別 FilePath=os.path.join(MainFolder,SubFolder,FileName) TestImIdSet=GetPointTxtImIdSet(FilePath) FileName='7480_8594ManyBlis.txt' FilePath=os.path.join(MainFolder,SubFolder,FileName) ManyBlisImIdSet=GetPointTxtImIdSet(FilePath)#獲取txt記錄的連續多張圖片中某個類別的樣本量佔很大比重的圖片名 FileName='8594-8879ManyBreak.txt' FilePath=os.path.join(MainFolder,SubFolder,FileName) ManyBreakImIdSet=GetPointTxtImIdSet(FilePath)#獲取txt記錄的連續多張圖片中某個類別的樣本量佔很大比重的圖片名 ImIdSet0=range(1,CurImNum+1) ImIdSet1=list(set(ImIdSet0).difference(set(TestImIdSet)))#從總集合中去除測試集合 ImIdSet2=list(set(ImIdSet1).difference(set(ManyBlisImIdSet))) ImIdSet=list(set(ImIdSet2).difference(set(ManyBreakImIdSet))) TrainImId,ValImId=AssignImIdSetAsRatio(ImIdSet,TrainR)#非txt記錄的集合按比例分配 MBlistTrainImId,MBlistValImId=AssignImIdSetAsRatio(ManyBlisImIdSet,TrainR)#txt記錄的小集合單獨按比例分配 MBreakTrainImId,MBreakValImId=AssignImIdSetAsRatio(ManyBreakImIdSet,TrainR)#txt記錄的小集合單獨按比例分配 #小集合合併為大集合 TrainImId=TrainImId+MBlistTrainImId+MBreakTrainImId ValImId=ValImId+MBlistValImId+MBreakValImId TrainValImId=ImIdSet1 WriteImIdSet2TrainValTxt(TrainImId,ValImId,TrainValImId) ######################last end############################# FixTestDeassignTrainVal()