1. 程式人生 > >數據挖掘比賽,構建矩陣時的腦殘行為

數據挖掘比賽,構建矩陣時的腦殘行為

this encoder scn 就會 lis nbsp tor desc 函數

scipy.sparse.hstack(blocks, format=None, dtype=None)[source]?

Stack sparse matrices horizontally (column wise)

Parameters:
blocks

sequence of sparse matrices with compatible shapes

format : str

sparse format of the result (e.g. “csr”) by default an appropriate sparse matrix format is returned. This choice is subject to change.

dtype : dtype, optional

The data-type of the output matrix. If not given, the dtype is determined from that of blocks.

上面是出錯函數

///////////////////////////////////////////////////////////////////////////////////////////////////

再比賽中,把特征變為系數矩陣,照著開源來改的:

base_train_csr = np.float64(train_x[num_feature])
    base_predict_csr = np.float64(predict_x[num_feature])

    enc = OneHotEncoder()   
    for feature in short_cate_feature:
        enc.fit(data[feature].values.reshape(-1, 1))
        base_train_csr = sparse.hstack((base_train_csr, enc.transform(train_x[feature].values.reshape(-1, 1))), ‘csr‘,‘bool‘)
        base_predict_csr = sparse.hstack((base_predict_csr, enc.transform(predict_x[feature].values.reshape(-1, 1))), ‘csr‘, ‘bool‘)
    print(‘one-hot prepared !‘)

    cv = CountVectorizer(min_df=20)
    for feature in long_cate_feature: 
        cv.fit(data[feature])
        base_train_csr = sparse.hstack((base_train_csr, cv.transform(train_x[feature])), ‘csr‘, ‘int‘)
        base_predict_csr = sparse.hstack((base_predict_csr, cv.transform(predict_x[feature])), ‘csr‘,‘int‘)
    print(‘cv prepared !‘)

特征放如lgb,loss急速下降驚了。一晚上沒找到原因,

今天從頭做簡單實驗,找到原因。

上面,我先對數值特征,直接用np轉換,類別較少的特征,用onehot編碼,問題就出現在這: sparse.hstack( , ‘csr‘,‘bool‘)

我把float(64)的矩陣直接與bool行的矩陣相連,然後轉化為成了bool形,腦殘啊,前面的數值特征全都沒用了。。。。。。。。。。。。。。。。

總結:以後再使用hstack的時候,要從粗粒度往細粒度加,如bool->int32->float32->float64,,要不然細粒度的特征就會被壓縮,信息損失很多

數據挖掘比賽,構建矩陣時的腦殘行為