1. 程式人生 > >SAS DM資料準備讀書筆記9(資料抽樣與拆分)

SAS DM資料準備讀書筆記9(資料抽樣與拆分)

分析大型資料集時,不能一下就把所有資料拿來分析,因此常常抽取一部分來測試。另外,在訓練模型的時候,也常常將資料集分成3部分,訓練集,校驗集和測試集。

因此,抽樣的方法也是必須要掌握的基礎技能。講到抽樣,曾經幫深圳供電局設計過一個營業稽查抽樣決策系統,裡面用了很多很複雜的抽樣方法,以保證各個區局所抽樣本的得分能夠近似整個區局的真實情況。這應該是我負責的第一個商業統計分析專案,很懷念那段一起戰鬥的歲月和弟兄們,可惜的是當時專案小組的人員已經都離開專案小組,各奔前程。

抽樣最常用的就是隨機抽樣和分層抽樣,SAS 關於抽樣有許多PROC,包括:

PROC SURVEYSELECT

PROC SURVEYMEANS

PROC SURVEYFREQ

PROC SURVEYREG

PROC SURVEYLOGISTIC

PROC SURVEYPHREG

(1)簡單隨機抽樣。待抽樣資料集PopDs,抽樣結果儲存在SampleDS,樣本數SampleSize,抽樣方法SRS(簡單隨機抽樣)

%MACRO RandomSample(PopDS, SampleDS, SampleSize);
/* This macro performes simple random sampling */
PROC SURVEYSELECT
	DATA=&PopDs 
	METHOD=srs 
	N=&SampleSize 
	NOPRINT 
	OUT=&SampleDS;
RUN;
%MEND;

(2)不重複抽樣

如果一個數據集要提出出訓練集和校驗集,必須保證這兩個資料集不重合。

演算法設計思路:首先從總體S中抽取第一個資料集S1,並建立1個變數selected=1,然後將S1與S合併,從而S中的selected欄位標註了哪些記錄屬於S1,然後對selected不等於1的記錄進行抽樣,生成S2資料集。

另外,在抽樣時還必須考慮一個平衡問題,比如在行用卡違約分析問題中,如果總體中違約的比例如果是0.1%,但在兩個子集中,為了模型結果顯著,可能要求抽出的兩個樣本違約比例為10%。

程式實現如下:

/*** Data Preparation for Data Mining Using SAS
     by Mamdouh Refaat
     Morgan Kaufmann, 2006
****/



%macro B2samples(S,IDVar,DV,S1,N1,P1,S2,N2,P2,M_St);
/* 
This macro attempts to draw two balanced samples S1, S2 of sizes
N1, N2 and proportions P1, P2, from a population (Dataset S). 
The balancing is based on the values of the DV, which are 
restricted to "1","0". Missing values of DV are ignored.

Before trying to do the sampling, the macro checks the 
consistency conditions, if either of them are not satisfied, 
an error message will be generated in the variable M_St. 
If the data passes the checks, then Status will be set to 
"OK" and the samples will be calculated and output to the 
datasets S1, S2.

All sampling work is based on using the IDVar variable in the 
population dataset S. It is recommended that the dataset S 
contains only the ID and DV variables for good performance. 

The macro guarantees that the two datasets are disjoint. 
*/

/*Calculate N, P of the population*/
proc sql noprint;
 select count(*)into : N from &S; /* Size of population */
 select count(*) into : NP 
             from &S where &DV=1; /* count of "1" */
 run;
 quit;
%let NPc=%eval(&N - &NP);	/* Count of "0" (compliment)*/

/* Check the consistency conditions */

%let Nx=%eval(&N1 + &N2);
%if &Nx > &N %then %do;
	%let &M_st = Not enough records in population to 
	             generate samples. Sampling canceled. ;
	%goto Exit;
				   %end;


/* N1 P1 + N2 P2 <= N P */

%let Nx = %sysevalf((&N1*&P1+ &N2 * &P2), integer);
%if &Nx >&NP %then %do;
  %let &M_st = Count of DV=1 in requested samples exceed 
             total count in population. Sampling canceled.;
	 %goto Exit;
	 			  %end;

/* N1(1-P1) + N2(1-P2) <= N(1-P)*/
%let Nx = %sysevalf( (&N1*(1-&P1) + &N2*(1-&P2) ), integer);
%if &Nx > &NPc %then %do;
  %let &M_st = Count of DV=0 in requested samples 
           exceed total count in population. Sampling canceled.;
	 %goto Exit;
					%end;
/* Otherwise, OK */
%let &M_St=OK;

/* Sort the population using the DV in ascending order*/
proc sort data=&S;
	by &DV;
run;

/* Draw the sample S1 with size N1 and number 
   of records N1P1, N1(1-P1) in the strata 1,0 of the DV */
%let Nx1=%Sysevalf( (&N1*&P1),integer);
%let Nx0=%eval(&N1 - &Nx1);

proc surveyselect noprint
		data =&S
        method = srs
	    n=( &Nx0 &Nx1)
		out=&S1;
		strata &DV;
run;

/* Add a new field to S1 call it (Selected) 
   and give it a value of 1. */
data &S1; 
 set &S1;
  selected =1;
  keep &IDVar &DV Selected;
 run;

/* Merge S1 with the population S to find the 
   already selected fields.*/
proc sort data=&S;
	by &IDVar;
run;

proc sort data=&S1;
	by &IDVar;
run;
Data temp;
 merge &S &S1;
 by &IDvar;
  keep &IDVar &DV Selected;
run;

/* Draw the sample S2 with size N2 and number 
   of records N2P2, N2(1-P2) in the strata 1,0 of the 
   DV under the condition that Selected is NOT 1 */
   
proc sort data=temp;
	by &DV;
run;
%let Nx1=%Sysevalf( (&N2*&P2),integer);
%let Nx0=%eval(&N2 - &Nx1);
proc surveyselect noprint
	    data =temp
        method = srs
	    n=( &Nx0 &Nx1)
		out=&S2;
		strata &DV;
		where Selected NE 1;
run;

/* clean S1, S2  and workspace*/
Data &S1;
 set &S1;
 keep &IDvar &DV;
run;

Data &S2;
 set &S2;
 keep &IDVar &DV;
run;

proc datasets library=work nodetails;
 delete temp;
run;
quit;

%exit: ;
%mend;