1. 程式人生 > >基於樸素貝葉斯分類器的 20-news-group分類及結果對比(Python3)

基於樸素貝葉斯分類器的 20-news-group分類及結果對比(Python3)

之前看了很多CSDN文章,很多都是根據stack overflow 或者一些英文網站的照搬。導致我看了一整天最後一點收穫都沒有。 這個作業也借鑑了很多外文網站的幫助 但是是基於自己理解寫的,算是一個學習筆記吧。環境是python3(海外留學原因作業是英文的,渣英語見諒吧)程式碼最後附上。
MNB 這個multinomial naive bayes function 是從一個網站上拿的, 具體的reference找不到了。個人覺得對於這個作業很有意義。

This Task is going to produce a text classification for 20-newgroups corpus. It will be implied with python 3.6
As usual, the first step is collecting data.
By visiting the corpus’s homepage(

http://qwone.com/~jason/20Newsgroups/), the 20news-bydate-matlab.tgz will be downloaded which contain three types of data: .data, .label and .map for training and testing respectively. The data contains several type of information: docIdhx (Document_Index), wordIdx(word_Index),Count(word_Count), LabelID, Label Name.
Describe this 20 Newsgroups data set.
As its name shows, this data set is consisting of 20 classes news. By list the proportion of each class in the dataset and the word distribution, the data set can be described clearly.
Firstly, the directory is settled and the train.label will be opened to check how many documents are in this dataset. The lines of records stand the number of documents. So, we use: total = len(lines) and print it:
total: 11269
Hence there are 11269 documents are stored in this dataset.
By checking train.map we know there are 20 different class. Here I calculate the occurrence and proportion to stand the dataset. In addition, all the proportion keeps five significant digits
Occurrence of each class:
1: 480. 2: 581. 3: 572. 4: 587. 5: 575. 6: 592. 7: 582. 8: 592. 9: 596. 10: 594. 11: 598. 12: 594. 13: 591. 14: 594. 15: 593. 16: 599. 17: 545. 18: 564. 19: 464. 20: 376
Probability of each class:
1: 0.04259, 2: 0.05156, 3: 0.05076, 4: 0.05209, 5: 0.05102, 6: 0.05253, 7: 0.05165, 8: 0.05253, 9: 0.05289, 10: 0.05271,11: 0.05307, 12: 0.05271, 13: 0.05244, 14: 0.05271, 15: 0.05262,16: 0.05315, 17: 0.04836, 18: 0.05005, 19: 0.04117, 20: 0.03337
As we can see that the 20 classes’ occurrence are in range of (376,599). The largest difference between the proportion is 0.1978, a very small range comparing with the whole data set,1.0000. Thus, we could say that the documents are almost uniform distributed in the 20 classes.
Train.map is a file that contain the label name correspond its labelID. By opening train.m-ap, all the class name can be list as below:
alt.atheism 1
comp.graphics 2
comp.os.ms-windows.misc 3
comp.sys.ibm.pc.hardware 4
comp.sys.mac.hardware 5
comp.windows.x 6
misc.forsale 7
rec.autos 8
rec.motorcycles 9
rec.sport.baseball 10
rec.sport.hockey 11
sci.crypt 12
sci.electronics 13
sci.med 14
sci.space 15
soc.religion.christian 16
talk.politics.guns 17
talk.politics.mideast 18
talk.politics.misc 19
talk.religion.misc 20
There are the all label names correspond with their IDs, which could help us match with the upper list and get the proportions of each class.
In terms of words, we could open the train.data and check how many words are in the documents:
there are 11269 documents in the data set
there are 53975 words in the data set
By doing the same operation in the test data, then the result is:
7505 documents in test data set
61188 unique word in test data set
In conclusion, the training data set contains 11269 documents, in which include 53975 different words. The documents are in 20 different classes. Each class is assigned with a label name and an ID. Due to the variance of documents is small. the distribution of documents in classes can been seen as uniform distribution. There are 7505 documents, consist of 53975 words in test data set.
In addition, words in test set are more than in train set that may means there may some strange words occur while testing. In case the probability equal to zero, the lapalace smooth shall be used in bayes function.`
2. Describe how each document is represented in your implementation.

Every document can be seen as a set which consist of different words along with its occurrence.
TF-IDF is a numerical statistic which can reflect the importance of a word to a document in a collection or corpus. The importance of a word will increase with the times of appears in the file, decreases inversely with the frequency it appears in the corpus.
TF, which means terms frequency, represent how many times a word occurs in a document.
TF

A TF-IDF Metrix can be utilized to stand a document and can also be used for naïve bayes function.
Here I use Term frequency to generate a matrix correspond with ‘wordIdx’ and ‘docIdx’
Image1-1:the Term frequency Matrix
This matrix, which contain 11269 rows × 53975 columns, shows each word’s term frequency in different document. Each raw can represent a document that contain different words. As there are 11269 documents and 53975 different words, the amount of term frequency is very small. If the smooth technique were used in this matrix, the NaN value will be replaced with a extremely small value, which may result a unclearly and tidy matrix.
However, the terms frequency sometimes cannot measure the importance of a word correctly because for example: ‘the’, the word is too common to be a good keyword. Hence another concept: IDF (Inverse Document Frequency) which is a kind of measurement that examine whether the word is ‘so common’, could be a factor to distinguish relevant and non-relevant document.
Hence this matrix cannot contribute to Naïve Bayes Classifier as factor. Only TF-IDF matrix can.Moreover, the TF-IDF matrix shall be considered in class dimension, not document. Because the target classifier is to category the document into classes. The document is represented by words. The question can be transferred into: category a set of words into classes. The classes can be assumed as a replacement of documents. The expect model shall category 11269 different set of words into 20 documents.

3. Describe Naïve Bayes classifier and how you use it to classify the 20 Newsgroups data set.

There are three kinds of different Naïve Bayes Model under sklearn library:
GaussianNB: It is used in classification and it assumes that features follow a normal distribution.
MultinomialNB: It is used for discrete counts, suit for text analysis.
BernoulliNB: The binomial model is useful if the feature vectors are binary (i.e. zeros and ones). One application would be text classification with ‘bag of words’ model where the 1s & 0s are “word occurs in the document” and “word does not occur in the document” respectively.
Hence the multinomial Naïve Bayes is the most suitable classifier for this task. By using the classifier ’MultinomialNB’ published by ‘sklearn’ package, we could estimate the test label:
clf = MultinomialNB(alpha=0.001, fit_prior=True, class_prior=None)
clf.fit(x,df.classIdx)
predict=clf.predict(test_set)
After prediction, the accuracy is shown as below:
Correct rate: 4.516988674217188 %
The correct rate is only 4.51% which is so bad for a classifier. It may because the IDF value did not be considered. Then, as optimization, here is another solution:
First, I calculated terms frequency and Inverse Class frequency with smooth alpha = 0.001, denote as tf and icf. with this two matrixs, the model can be trained MNB, a function defined manually. (reference!)
As the data set shows, words in test set is more than in training set. Some words will not be trained.which may lead the result with probability always equal to zero that may influence the result in bad performance. Laplace smooth is also known as additive smooth, with smooth parameter alpha added, each word can be seen as ‘alpha’ times more than it really appears. The probability value will not equal to zero which could avoid such bad impact. There are 61188 words in test set. Hence Lambda=61188 Alpha=0.001, based on the formula:
在這裡插入圖片描述
The optimal probability should be: P = (Xi+α)/(N +Lambda*α)= (Xi+0.01)/(N+61.188).It will be used for Mutinomial naïve bayes.
The MNB function is an implement of the Multinomial Naïve Bayes. It first creates a matrix corresponding the wordIdx and classIdx because this task requires us training model by words and classify ‘a set of words’ into class. Like scores cannot be trained to classify based on student ID, in this case document can not stand classes or label. Then I use the additive smooth formula to the matrix. Calculate optimal probability for each word in each class.
Based on the matrix we get before, we can know each word occurrence probability in each class. With the probabilities, we can imply MNB function, train the model ‘df’ by the matrix and the training label. The full code will attach at the end of report. The error rate in training set is:
Error: 1.1181116336853314 %
As the error rate shows this model look better than the previous one which is what we want. Apply this model to test set and we can gain the classification accuracy.
4. Report the classification accuracy and confusion matrix.
As we discussed before, the second MNB function is better than which in library. Apply the MNB for test set:
#MNB Calculation
predict = MNB(test_set)
total = len(test_label)
val = 0
for i,j in zip(predict, test_label):
if i == j:
val +=1
else:
pass
print(“Correct rate:\t”,(val/total) * 100, “%”)
The correct rate is :
Correct rate: 79.8800799467022 %
The classification accuracy is nearly 80% which is acceptable.
In artificial intelligence, the confusion matrix is a visualization tool in which each column represents an prediction of a class, and each row represents the actual class. Based on the actual label, it is easier for researcher to examine if the machine is confusing two different classes.The result shows as below:
Image1-2: Confusion Matrix (Test_Label, Predict)
In this part, First I use the confusion matrix algorithm provided by sklearn package to generate a confusion matrix. Then according to the sklearn website, I imply a plot confusion matrix
All the number I keep 3 significant digit which could lead the matrix looks better and clearer. In this matrix, each row represents the true label, each column represents the predict label. Every cross point of raw and column that have the same name means the probability of predict successfully, otherwise failed. The probability is shown in the matrix as well with the color from white to blue. White color stand 0%, the darker the green become, the higher the probability is.
As this Matrix shows, the prediction performance on class comp.os.ms_windows.misc is not good, with only 0.55 true positive and it is easy to be confused with comp.sys.ibm.pc.hardware, with 15 percentage.

5. Plot the ROC and report the AUC.
As sklearn library published functions, a y_score is required, hence the y_score is added into MNB function as a extend return value. All the ROC curve start with (0,0), by setting the threshold equal to each case one by one, if thetrue label = predict label,then the line goes upper, otherwise it goes right. In this assignment, I use the sklearn’s solution to draw ROC image as below:
在這裡插入圖片描述

The ROC plot can examine the performance of a model in classifying different classes. The expect model shall achieve (0,1.0) in the plot. To examine the performance, AUC shall be used which is the area under the ROC. The bigger the AUC is, the better performance the model has. In this case, the ROC looks good, every ROC curve is upon the y=(x) line which means the model does make sense in terms of study. The average AUC =0.83 represent a good performance the model has. The Largest AUC and the smallest AUC is 0.94 and 0.72 respectively represent the class rec.sport.hockey and sci.med.
All the code will be attached at the end of report.

#coding: utf-8

#Task 1

#In[2]:


#CSCI946ASSIGNEMTN2
#STUDENT NAME:BOWEN SUN
#STUDENT NUMBER:5543654
#STUDENT LOGIN: bs361


#In[3]:


import os
import numpy as np
import pandas as pd
import operator
import matplotlib.pyplot as plt
import itertools
from itertools import cycle
from scipy import interp
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_curve, auc

path = "/Users/lenovo/Desktop/bowenA2"
os.chdir(path)


#In[4]:


#Training label
train_label = open('train.label')

#prop is the fraction of each class
prop = {}

#Set a class index for each document as key
for i in range(1,21):
    prop[i] = 0
    
#Extract values from training labels
lines = train_label.readlines()

#Get total number of documents
total = len(lines)

#Count the occurence of each class
for line in lines:
    val = int(line.split()[0])
    prop[val] += 1

#Divide the count of each class by total documents 
for key in prop:
    prop[key] /= total
        
print("total:",total)
print("Occurency of each class:")
print(".   ".join("{}: {}".format(i, k) for i, k in prop.items()))
print("Probability of each class:")
print(",".join("{}: {}".format(i,k) for i, k in prop.items()))


#In[5]:


#Training data
df = pd.read_csv('train.data', delimiter=' ', names=['docIdx', 'wordIdx', 'count'])

#Training label
label = []
train_label = open('train.label')
lines = train_label.readlines()
for line in lines:
    label.append(int(line.split()[0]))

#Increase label length to match docIdx
docIdx = df['docIdx'].values
i = 0
new_label = []
for index in range(len(docIdx)-1):
    new_label.append(label[i])
    if docIdx[index] != docIdx[index+1]:
        i += 1
new_label.append(label[i]) #for-loop ignores last value

#Add label column
df['classIdx'] = new_label

df


#In[6]:


#Get train label name
trainName = []
train_map = open('train.map')
lines = train_map.readlines()
for line in lines:
    trainName.append(str(line.split()[0]))


#In[7]:


#document numbers in train.data
print(max(df['docIdx']))
#word numbers in train.data
print(max(df['wordIdx']))


#In[8]:


#Get test data
test_data = open('test.data')
test_set = pd.read_csv(test_data, delimiter=' ', names=['docIdx', 'wordIdx', 'count'])
print(max(test_set['docIdx']))
print(max(test_set['wordIdx']))

#Get list of test data labels
test_label = pd.read_csv('test.label', names=['t'])
test_label = test_label['t'].tolist()

#Generate label matrix of 20 classes with 0 implement NAN
y_test = np.zeros(shape=(len(test_label),20))
for i in range(len(test_label)) :
    y_test[i][test_label[i]-1] = 1


#In[9]:


#Alpha value for smoothing
a = 0.001

#Calculate probability of each word based on class
#pb_ij = df.groupby(['classIdx','wordIdx'])
#pb_j = df.groupby(['classIdx'])
pb_ij = df.groupby(['docIdx','wordIdx'])
pb_j = df.groupby(['docIdx'])
#Pr =  (pb_ij['count'].sum() + a) / (pb_j['count'].sum() + 16689)    
Pr =  (pb_ij['count'].sum()) / (pb_j['count'].sum())    

#Unstack series
Pr = Pr.unstack()

#Replace NaN or columns with 0 as word count with a/(count+|V|+1)
#for c in range(1,21):
 #   Pr.loc[c,:] = Pr.loc[c,:].fillna(a/(pb_j['count'].sum()[c] + 16689))

#Convert to dictionary for greater speed
#Pr_dict = Pr.to_dict()
Pr


#In[78]:


pb_ij = df.groupby(['classIdx','wordIdx'])
pb_j = df.groupby(['classIdx'])
Pr =  (pb_ij['count'].sum() + a) / (pb_j['count'].sum() +618.8)  
#Unstack series
Pr = Pr.unstack()

#Replace NaN or columns with 0 as word count with a/(count+|V|+1)
for c in range(1,21):
   Pr.loc[c,:] = Pr.loc[c,:].fillna(a/(pb_j['count'].sum()[c] +618.8 ))

#Convert to dictionary for greater speed
Pr_dict = Pr.to_dict()
#Calculate IDF 
tot = len(df['docIdx'].unique()) 
pb_ij = df.groupby(['wordIdx']) 
IDF = np.log(tot/pb_ij['docIdx'].count()) 
#IDFc=IDF.values.reshape(len(IDF),1)
IDF_dict = IDF.to_dict()
#TFIDF=IDF*Pr
#TFIDF
#IDFc


#In[79]:



def MNB(df):
    '''
    Multinomial Naive Bayes classifier
    :param df [Pandas Dataframe]: Dataframe of data
    :param smooth [bool]: Apply Smoothing if True
    :param IDF [bool]: Apply Inverse Document Frequency if True
    :return predict [list]: Predicted class ID
    '''
    #Using dictionaries for greater speed
    df_dict = df.to_dict()
    new_dict = {}
    prediction = []
    roc_score=[]
    #new_dict = {docIdx : {wordIdx: count},....}
    for idx in range(len(df_dict['docIdx'])):
        docIdx = df_dict['docIdx'][idx]
        wordIdx = df_dict['wordIdx'][idx]
        count = df_dict['count'][idx]
        try: 
            new_dict[docIdx][wordIdx] = count 
        except:
            new_dict[df_dict['docIdx'][idx]] = {}
            new_dict[docIdx][wordIdx] = count

    #Calculating the scores for each doc
    for docIdx in range(1, len(new_dict)+1):
        score_dict = {}
        #Creating a probability row for each class
        for classIdx in range(1,21):
            score_dict[classIdx] = 1
            #For each word:
            for wordIdx in new_dict[docIdx]:
                #Check for frequency smoothing
                #log(1+f)*log(Pr(i|j))
                    try:
                        probability=Pr_dict[wordIdx][classIdx]         
                        power = np.log(1+ new_dict[docIdx][wordIdx])     
                        #Check for IDF
                       # if IDF:
                        score_dict[classIdx]+=(
                            power*np.log(
                            probability*IDF_dict[wordIdx]))
                    except:
                        #Missing V will have log(1+0)*log(a/16689)=0 
                        score_dict[classIdx] += 0                        
              
            score_dict[classIdx] +=  np.log(prop[classIdx])                          

        #Get class with max probabilty for the given docIdx 
        max_score = max(score_dict, key=score_dict.get)
        prediction.append(max_score)
             
        #max_dict = max(score_dict, key=score_dict.get)
        min_dict = min(score_dict, key=score_dict.get)
        minimum = score_dict[min_dict]
        #maximum = score_dict[max_dict]
        #for classIdx in range(1,21):
        #    score_dict[classIdx] = maximum/score_dict[classIdx]
        
        for classIdx in range(1,21):
            score_dict[classIdx] = (-minimum) + score_dict[classIdx]
        
        list_values = [v for v in score_dict.values()]
        roc_score.append(list_values)
    return roc_score, prediction


#In[80]:


#Test the error rate of factor smooth and factor IDF
#Get list of labels
train_label = pd.read_csv('train.label', names=['t'])
train_label= train_label['t'].tolist()
total = len(train_label) 
#strings = ['Regular', 'Smooth', 'IDF', 'Both'] 
roc_score,predict = MNB(df)
val = 0
for i,j in zip(predict, train_label):
    if i != j:
        val +=1
    else:
        pass   
print("Error:\t\t",val/total * 100, "%")


#In[81]:


#Get test label name
testName = []
test_map = open('test.map')
lines = test_map.readlines()
for line in lines:
    testName.append(str(line.split()[0]))

#MNB Calculation
roc_s,predict = MNB(test_set)

total = len(test_label)
val = 0
for i,j in zip(predict, test_label):
    if i == j:
        val +=1
    else:
        pass
print("Correct rate:\t",(val/total) * 100, "%")


#In[82]:


#Calculate the complete matrix of row document and features
def complete_matrix(df):
    df_dict = df.to_dict()
    new_dict = {}
    
    #new_dict = {docIdx : {wordIdx: count},....}
    for idx in range(len(df_dict['docIdx'])):
        docIdx = df_dict['docIdx'][idx]
        wordIdx = df_dict['wordIdx'][idx]
        count = df_dict['count'][idx]
        try: 
            new_dict[docIdx][wordIdx] = count 
        except:
            new_dict[df_dict['docIdx'][idx]] = {}
            new_dict[docIdx][wordIdx] = count
    return new_dict
compMatrix = complete_matrix(test_set)
compMatrix[1]


#In[83]:


def plot_confusion_matrix(cm, classes,
                          title='Confusion matrix',
                          cmap=plt.cm.Greens):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    print(cm)
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=90)
    plt.yticks(tick_marks, classes)
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], '.2f'),
        horizontalalignment="center",
        color="white" if cm[i, j] > thresh else "black")

    plt.ylabel('True')
    plt.xlabel('Predict')
    plt.tight_layout()

#Compute confusion matrix
cma= confusion_matrix(test_label, predict, labels=None, sample_weight=None)
#cnf_matrix = confusion_matrix(test_label, predict)
np.set_printoptions(precision=2)

#Plot non-normalized confusion matrix
plt.figure(figsize=(12, 10), facecolor='w', edgecolor='b')
plot_confusion_matrix(cnf_matrix, classes=testName,
                     title='Confusion Matrix By Bowen')
cma
#plt.savefig('confusionMatrix.png')
#plt.show()


#In[85]:


#Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()

y_score = np.array(roc_s)

#Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()

for i in range(20):
    fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])

#Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])

#Compute macro-average ROC curve and ROC area
#First aggregate all false positive rates
all_fpr = np.unique(np.concatenate([fpr[i] for i in range(20)]))

#Then interpolate all ROC curves at this points
mean_tpr = np.zeros_like(all_fpr)
for i in range(20):
    mean_tpr += interp(all_fpr, fpr[i], tpr[i])

#Finally average it and compute AUC
mean_tpr /= 20

fpr["macro"] = all_fpr
tpr["macro"] = mean_tpr
roc_auc["macro"] = auc(fpr["macro"], tpr["macro"])

#Plot of a ROC curve for a specific class
plt.figure(figsize=(14, 12))
lw = 2
plt.plot(fpr["macro"], tpr["macro"],
         label='macro-average ROC curve (area = {0:0.2f})'
               ''.format(roc_auc["macro"]),
         color='navy', linestyle=':', linewidth=4)
colors = cycle(['red', 'orange','yellow','green','blue','darkblue','black',])
for i, color, j in zip(range(20), colors, testName):
    plt.plot(fpr[i], tpr[i], color=color, lw=lw,
             label='ROC curve of class {0} (area = {1:0.2f})'
             ''.format(j, roc_auc[i]))
plt.plot([0, 1], [0, 1], 'k--', lw=lw)
plt.xlim([0.0, 1.01])
plt.ylim([0.0, 1.01])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC&AUC')
plt.legend(loc="lower right")
plt.show()

`