1. 程式人生 > >使用scikit-learn進行文字分類

使用scikit-learn進行文字分類

1. 資料來源

所用的資料是分類好的資料,詳細描述見SMS Spam Collection v. 1,可以從github下載,資料在第4章。每一行資料包括包括兩列,使用逗號隔開, 第1列是分類(lable),第2列是文字。

sms = pd.read_csv(filename, sep=',', header=0, names=['label','text'])

sms.head
Out[5]: 
<bound method DataFrame.head of      label                                               text
0      ham  Go until jurong point, crazy.. Available only ...
1      ham                      Ok lar... Joking wif u oni...
2     spam  Free entry in 2 a wkly comp to win FA Cup fina...
3      ham  U dun say so early hor... U c already then say...
4      ham  Nah I don't think he goes to usf, he lives aro...
5     spam  FreeMsg Hey there darling it's been 3 week's n...
6      ham  Even my brother is not like to speak with me. ...
7      ham  As per your request 'Melle Melle (Oru Minnamin...
8     spam  WINNER!! As a valued network customer you have...
9     spam  Had your mobile 11 months or more? U R entitle...
10     ham  I'm gonna be home soon and i don't want to tal...
2. 資料準備

總共有5574行資料,隨機從中抽取500行作為測試資料集,其它的作為訓練資料集,為此定義了一個函式。執行後發現這個函式有一點小問題,它取不到500個數據,會少幾個,分析原因,應該是產生的隨機數有重複導致的。n為抽取的資料行數,size是整個資料集的行數。

def randomSequence(n, size):
    result = [0 for i in range(size)]
    for i in range(n):
        x = random.randrange(0, size-1, 1)
        result[x] = 1
    return result
3. 特徵提取

進行文字分類,在呼叫演算法之前需要將文字內容轉換成特徵。 scikit-learn提供的CountVectorizer, TfidfTransformer兩個類可以完成特徵的提取。測試資料集共用了訓練資料集產生的詞彙表。

4.完整的程式碼

# -*- coding: utf-8 -*-

import random
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.naive_bayes import MultinomialNB 

#生成選擇訓練資料和測試資料的隨機序列
def randomSequence(n, size):
    result = [0 for i in range(size)]
    for i in range(n):
        x = random.randrange(0, size-1, 1)
        result[x] = 1
    return result
    
if __name__ == '__main__':
    #讀資料
    filename = 'data/sms_spam.csv'  
    sms = pd.read_csv(filename, sep=',', header=0, names=['label','text'])
    
    #拆分訓練資料集和測試資料集
    size = len(sms)
    sequence = randomSequence(500, size)
    sms_train_mask = [sequence[i]==0 for i in range(size)]
    sms_train = sms[sms_train_mask]  
    sms_test_mask = [sequence[i]==1 for i in range(size)]   
    sms_test = sms[sms_test_mask]
    
    #文字轉換成TF-IDF向量   
    train_labels = sms_train['label'].values
    train_features = sms_train['text'].values
    count_v1= CountVectorizer(stop_words = 'english', max_df = 0.5, decode_error = 'ignore') 
    counts_train = count_v1.fit_transform(train_features)
    #print(count_v1.get_feature_names())
    #repr(counts_train.shape) 
    tfidftransformer = TfidfTransformer()
    tfidf_train = tfidftransformer.fit(counts_train).transform(counts_train)
    
    test_labels = sms_test['label'].values
    test_features = sms_test['text'].values
    count_v2 = CountVectorizer(vocabulary=count_v1.vocabulary_,stop_words = 'english', max_df = 0.5, decode_error = 'ignore')
    counts_test = count_v2.fit_transform(test_features)
    tfidf_test = tfidftransformer.fit(counts_test).transform(counts_test)
    
    #訓練
    clf = MultinomialNB(alpha = 0.01)
    clf.fit(tfidf_train, train_labels)
    
    #預測
    predict_result = clf.predict(tfidf_test)
    #print(predict_result)
    
    #正確率
    correct = [test_labels[i]==predict_result[i] for i in range(len(predict_result))]
    r = len(predict_result)
    t = correct.count(True)
    f = correct.count(False)
    print(r, t, f, t/float(r) )
以上用的是貝葉斯分類演算法,也可以換其他演算法。

執行結果

runfile('E:/MyProject/_python/ScikitLearn/NaiveBayes.py', wdir='E:/MyProject/_python/ScikitLearn')
(476, 468, 8, 0.9831932773109243)