1. 程式人生 > >.mat,.txt,.csv 資料轉換為weka中的arff格式及matlab和Weka之間相互轉換格式

.mat,.txt,.csv 資料轉換為weka中的arff格式及matlab和Weka之間相互轉換格式

function r = CSVtoARFF (data, relation, type)
% csv to arff file converter

% load the csv data
[rows cols] = size(data);

% open the arff file for writing
farff = fopen(strcat(type,'.arff'), 'w');

% print the relation part of the header
fprintf(farff, '@relation %s', relation);

% Reading from the ARFF header
fid = fopen('ARFFheader.txt','r'); tline = fgets(fid); while ischar(tline) tline = fgets(fid); fprintf(farff,'%s',tline); end fclose(fid); % Converting the data for i = 1 : rows % print the attribute values for the data point for j = 1 : cols - 1 if data(i,j) ~= -1 % check if it is a missing value
fprintf(farff, '%d,', data(i,j)); else fprintf(farff, '?,'); end end % print the label for the data point fprintf(farff, '%d\n', data(i,end)); end % close the file fclose(farff); r = 0;

該方法的不足之處就是要單獨提供ARFFheader.txt ,很多情況下,該表頭需要人工新增(屬性少時),但當屬性大時,相對較麻煩,還是可以通過程式迴圈新增。

function Mat2Arff('input_filename','arff_filename')
%
% This function is used to convert the input data to '.arff'
% file format,which is compatible to weka file format ...
%
% Parameters:
% input_filename -- Input file name,only can conversion '.mat','.txt'
% or '.csv' file format ...
% arff_filename -- the output '.arff' file ...

% NOTEs:
%The input 'M*N' file data must be the following format:
% M: sampel numbers;
% N: sample features and label,"1:N-1" -- features, "N" - sample label ...


% 讀取檔案資料 ...
if strfind(input_filename,'.mat')
matdata = importdata(input_filename);
elseif strfind(input_filename,'.txt')
matdata = textread(input_filename) ;
elseif strfind(input_filename,'.csv')
matdata = csvread(input_filename);
end;

[row,col] = size(matdata);
f = fopen(arff_filename,'wt');
if (f < 0)
error(sprintf('Unable to open the file %s',arff_filename));
return;
end;
fprintf(f,'%s\n',['@relation ',arff_filename]);
for i = 1 : col - 1
st = ['@attribute att_',num2str(i),' numeric'];
fprintf(f,'%s\n',st);
end;
% 儲存檔案頭最後一行類別資訊
floatformat = '%.16g';
Y = matdata(:,col);
uY = unique(Y); % 得到label型別
st = ['@attribute label {'];
for j = 1 : size(uY) - 1
st = [st sprintf([floatformat ' ,'],uY(j))];
end;
st = [st sprintf([floatformat '}'],uY(length(uY)))];
fprintf(f,'%s\n\n',st);
% 開始儲存資料 ...
labelformat = [floatformat ' '];
fprintf(f,'@data\n');
for i = 1 : row
Xi = matdata(i,1:col-1);
s = sprintf(labelformat,Y(i));
s = [sprintf([floatformat ' '],[; Xi]) s];
fprintf(f,'%s\n',s);
end;
fclose(f);

最後給出關於weka資料處理的簡明介紹。
資料探勘簡述和weka介紹–資料探勘學習和weka使用(一)
輸入資料與ARFF檔案–資料探勘學習和weka使用(二)
簡單總結一下:
weka中的arff格式資料是由兩部分組成:頭部定義和資料區。
頭部定義包含了關係名稱(relation name)、一些屬性(attributes)和對應的型別,如

   @RELATION iris

   @ATTRIBUTE sepallength  NUMERIC 
   @ATTRIBUTE sepalwidth   NUMERIC 
   @ATTRIBUTE petallength  NUMERIC 
   @ATTRIBUTE petalwidth   NUMERIC 
   @ATTRIBUTE class        {Iris-setosa,Iris-versicolor,Iris-virginica}

NUMERIC說明其為數字型,屬性class的取值是限定的,只能是Iris-setosa,Iris-versicolor,Iris-virginica中的一個。資料型別還可以是string和data資料區有@data開頭,如:

@DATA 

   5.1,3.5,1.4,0.2,Iris-setosa 
   4.9,3.0,1.4,0.2,Iris-setosa 
   4.7,3.2,1.3,0.2,Iris-setosa 
   4.6,3.1,1.5,0.2,Iris-setosa 
   5.0,3.6,1.4,0.2,Iris-setosa 
   5.4,3.9,1.7,0.4,Iris-setosa 
   4.6,3.4,1.4,0.3,Iris-setosa 
   5.0,3.4,1.5,0.2,Iris-setosa 
   4.4,2.9,1.4,0.2,Iris-setosa 
   4.9,3.1,1.5,0.1,Iris-setosa

因此,完整的一個arff檔案如下:

@RELATION iris

@ATTRIBUTE sepallength  NUMERIC 
@ATTRIBUTE sepalwidth   NUMERIC 
@ATTRIBUTE petallength  NUMERIC 
@ATTRIBUTE petalwidth   NUMERIC 
@ATTRIBUTE class        {Iris-setosa,Iris-versicolor,Iris-virginica}

@DATA 
5.1,3.5,1.4,0.2,Iris-setosa 
4.9,3.0,1.4,0.2,Iris-setosa 
4.7,3.2,1.3,0.2,Iris-setosa 
4.6,3.1,1.5,0.2,Iris-setosa 
5.0,3.6,1.4,0.2,Iris-setosa 
5.4,3.9,1.7,0.4,Iris-setosa 
4.6,3.4,1.4,0.3,Iris-setosa 
5.0,3.4,1.5,0.2,Iris-setosa 
4.4,2.9,1.4,0.2,Iris-setosa 
4.9,3.1,1.5,0.1,Iris-setosa

weka使用自己的檔案格式,叫做ARFF,如果想從*matlab和Weka之間相互轉換,這裡有現成的package*:

不要以為下載下來就能用,你會在如下地方報錯:

if(~wekaPathCheck),wekaOBJ = []; return,end

import weka.core.converters.ArffLoader;

import java.io.File;

Tricky的事情就是得把weka.jar加入到matlab的classpath.txt列表。classpath.txt在哪兒?到matlab的command視窗敲:

which classpath.txt
D:\CMWang\MATLABR2014b\toolbox\local\classpath.txt

然後就是到classpath.txt里加入一行,weka.jar的絕對路徑,例如:

C:\Program Files\Weka-3-8 \weka.jar