1. 程式人生 > >第十二次作業

第十二次作業

limit ray dict isa pst 一行 讀數 select 代碼

樸素貝葉斯應用:垃圾郵件分類

代碼:

import csv
# 讀數據
file_path = rEmailData.txt
EmailData = open(file_path,r,encoding=utf-8)
Email_data = []
Email_target = []
csv_reader = csv.reader(EmailData,delimiter=\t)
# 將數據分別存入數據列表和目標分類列表
for line in csv_reader:
    Email_data.append(line[1])
    Email_target.append(line[0])
EmailData.close()

# 把無意義的符號都替換成空格 Email_data_clear = [] for line in Email_data: # line :‘Go until jurong point, crazy.. Available only in bugis n great world la e buffet...‘ # 每一行都去掉無意義符號並按空格分詞 for char in line: if char.isalpha() is False: # 不是字母,發生替換操作: newString = line.replace(char,"
") tempList = newString.split(" ") # 將處理好後的一行數據追加到存放幹凈數據的列表 Email_data_clear.append(tempList) # 去掉長度不大於3的詞和沒有語義的詞 Email_data_clear2 = [] for line in Email_data_clear: tempList = [] for word in line: if word != ‘‘ and len(word) > 3 and word.isalpha(): tempList.append(word) tempString
= .join(tempList) Email_data_clear2.append(tempString) Email_data_clear = Email_data_clear2 # 將數據分為訓練集和測試集 from sklearn.model_selection import train_test_split x_train,x_test,y_train,y_test = train_test_split(Email_data_clear2,Email_target,test_size=0.3,random_state=0,stratify=Email_target) # 建立數據的特征向量 from sklearn.feature_extraction.text import TfidfVectorizer tfidf = TfidfVectorizer() X_train = tfidf.fit_transform(x_train) X_test = tfidf.transform(x_test) # 觀察向量 import numpy as np X_train = X_train.toarray() X_test = X_test.toarray() X_train.shape # 輸出不為0的列 for i in range(X_train.shape[0]): for j in range(X_train.shape[1]): if X_train[i][j] != 0: print(i,j,X_train[i][j]) # 建立模型 from sklearn.naive_bayes import GaussianNB gnb = GaussianNB() module = gnb.fit(X_train,y_train) y_predict = module.predict(X_test) # 輸出模型分類的各個指標 from sklearn.metrics import classification_report cr = classification_report(y_predict,y_test) print(cr)

截圖:

清洗後的數據:

技術分享圖片

特征向量:

技術分享圖片

模型指標:

技術分享圖片

第十二次作業