第十二次作業
阿新 • • 發佈:2018-11-30
limit ray dict isa pst 一行 讀數 select 代碼
樸素貝葉斯應用:垃圾郵件分類
代碼:
import csv # 讀數據 file_path = r‘EmailData.txt‘ EmailData = open(file_path,‘r‘,encoding=‘utf-8‘) Email_data = [] Email_target = [] csv_reader = csv.reader(EmailData,delimiter=‘\t‘) # 將數據分別存入數據列表和目標分類列表 for line in csv_reader: Email_data.append(line[1]) Email_target.append(line[0]) EmailData.close()# 把無意義的符號都替換成空格 Email_data_clear = [] for line in Email_data: # line :‘Go until jurong point, crazy.. Available only in bugis n great world la e buffet...‘ # 每一行都去掉無意義符號並按空格分詞 for char in line: if char.isalpha() is False: # 不是字母,發生替換操作: newString = line.replace(char,"") tempList = newString.split(" ") # 將處理好後的一行數據追加到存放幹凈數據的列表 Email_data_clear.append(tempList) # 去掉長度不大於3的詞和沒有語義的詞 Email_data_clear2 = [] for line in Email_data_clear: tempList = [] for word in line: if word != ‘‘ and len(word) > 3 and word.isalpha(): tempList.append(word) tempString= ‘ ‘.join(tempList) Email_data_clear2.append(tempString) Email_data_clear = Email_data_clear2 # 將數據分為訓練集和測試集 from sklearn.model_selection import train_test_split x_train,x_test,y_train,y_test = train_test_split(Email_data_clear2,Email_target,test_size=0.3,random_state=0,stratify=Email_target) # 建立數據的特征向量 from sklearn.feature_extraction.text import TfidfVectorizer tfidf = TfidfVectorizer() X_train = tfidf.fit_transform(x_train) X_test = tfidf.transform(x_test) # 觀察向量 import numpy as np X_train = X_train.toarray() X_test = X_test.toarray() X_train.shape # 輸出不為0的列 for i in range(X_train.shape[0]): for j in range(X_train.shape[1]): if X_train[i][j] != 0: print(i,j,X_train[i][j]) # 建立模型 from sklearn.naive_bayes import GaussianNB gnb = GaussianNB() module = gnb.fit(X_train,y_train) y_predict = module.predict(X_test) # 輸出模型分類的各個指標 from sklearn.metrics import classification_report cr = classification_report(y_predict,y_test) print(cr)
截圖:
清洗後的數據:
特征向量:
模型指標:
第十二次作業