（資料探勘-入門-8）基於樸素貝葉斯的文字分類器

阿新 • • 發佈：2019-02-11

主要內容：

1、動機

2、基於樸素貝葉斯的文字分類器

3、python實現

一、動機

之前介紹的樸素貝葉斯分類器所使用的都是結構化的資料集，即每行代表一個樣本，每列代表一個特徵屬性。

但在實際中，尤其是網頁中，爬蟲所採集到的資料都是非結構化的，如新聞、微博、帖子等，如果要對對這一類資料進行分類，應該怎麼辦呢？例如，新聞分類，微博情感分析等。

本文就介紹一種基於樸素貝葉斯的文字分類器。

二、基於樸素貝葉斯的文字分類器

目標：對非結構化的文字進行分類

首先，回顧一下樸素貝葉斯公式：

特徵、特徵處理：

對於結構化資料，公式中的D代表的是樣本或一系列整理或抽象出來的特徵或屬性，

而在非結構化的資料中，只有文件和單詞，文件對應樣本，單詞對應特徵。

如果將單詞作為特徵，未免特徵太多了，一篇文章有那麼多單詞，而且有些單詞並不起什麼作用，因此需要對特徵即單詞進行處理。

停用詞：我們將一些常見的詞，如the, a, I, is, that等，稱為“停用詞”，因為它們在很多文章都會出現，不具稱為特徵的代表性。

模型引數計算：

別忘了，樸素貝葉斯的假設前提：在已知類別下，所有特徵是獨立的。（在文字中，即不考慮單詞之間的次序和相關性）

如何計算模型的引數，即已知類別時某文字的條件概率呢？（這裡只是進行簡單的統計而已，複雜一點的可以考慮TF-IDF作為單詞特徵，下面的公式已經做了平滑處理）

W_k：表示某個單詞

hi: 表示某個類別

nk: 表示單詞wk在類別hi中出現的次數

n：表示類別中的單詞總數

vocabulary：表示類別中的單詞數

分類：

來一篇新文章，如何判斷它是屬於哪一類呢？

如下公式，分別計算屬於每一類的概率，然後取概率最大的作為其類別。

應用：

新聞分類、垃圾郵件分類、微博情感分析等等

三、python實現

資料集：

程式碼：

1、新聞分類

from __future__ import print_function
import os, codecs, math

class BayesText:

    def __init__(self, trainingdir, stopwordlist):
        """This class implements a naive Bayes approach to text
        classification
        trainingdir is the training data. Each subdirectory of
        trainingdir is titled with the name of the classification
        category -- those subdirectories in turn contain the text
        files for that category.
        The stopwordlist is a list of words (one per line) will be
        removed before any counting takes place.
         
"""
        self.vocabulary = {}
        self.prob = {}
        self.totals = {}
        self.stopwords = {}
        f = open(stopwordlist)
        for line in f:
            self.stopwords[line.strip()] = 1
        f.close()
        categories = os.listdir(trainingdir)
        #filter out files that are not directories
        self.categories = [filename for filename in categories
                           if os.path.isdir(trainingdir + filename)]
        print("Counting ...")
        for category in self.categories:
            print('    ' + category)
            (self.prob[category],
             self.totals[category]) = self.train(trainingdir, category)
        # I am going to eliminate any word in the vocabulary
        # that doesn't occur at least 3 times
        toDelete = []
        for word in self.vocabulary:
            if self.vocabulary[word] < 3:
                # mark word for deletion
                # can't delete now because you can't delete
                # from a list you are currently iterating over
                toDelete.append(word)
        # now delete
        for word in toDelete:
            del self.vocabulary[word]
        # now compute probabilities
        vocabLength = len(self.vocabulary)
        print("Computing probabilities:")
        for category in self.categories:
            print('    ' + category)
            denominator = self.totals[category] + vocabLength
            for word in self.vocabulary:
                if word in self.prob[category]:
                    count = self.prob[category][word]
                else:
                    count = 1
                self.prob[category][word] = (float(count + 1)
                                             / denominator)
        print ("DONE TRAINING\n\n")
                    

    def train(self, trainingdir, category):
        """counts word occurrences for a particular category"""
        currentdir = trainingdir + category
        files = os.listdir(currentdir)
        counts = {}
        total = 0
        for file in files:
            #print(currentdir + '/' + file)
            f = codecs.open(currentdir + '/' + file, 'r', 'iso8859-1')
            for line in f:
                tokens = line.split()
                for token in tokens:
                    # get rid of punctuation and lowercase token
                    token = token.strip('\'".,?:-')
                    token = token.lower()
                    if token != '' and not token in self.stopwords:
                        self.vocabulary.setdefault(token, 0)
                        self.vocabulary[token] += 1
                        counts.setdefault(token, 0)
                        counts[token] += 1
                        total += 1
            f.close()
        return(counts, total)
                    
                    
    def classify(self, filename):
        results = {}
        for category in self.categories:
            results[category] = 0
        f = codecs.open(filename, 'r', 'iso8859-1')
        for line in f:
            tokens = line.split()
            for token in tokens:
                #print(token)
                token = token.strip('\'".,?:-').lower()
                if token in self.vocabulary:
                    for category in self.categories:
                        if self.prob[category][token] == 0:
                            print("%s %s" % (category, token))
                        results[category] += math.log(
                            self.prob[category][token])
        f.close()
        results = list(results.items())
        results.sort(key=lambda tuple: tuple[1], reverse = True)
        # for debugging I can change this to give me the entire list
        return results[0][0]

    def testCategory(self, directory, category):
        files = os.listdir(directory)
        total = 0
        correct = 0
        for file in files:
            total += 1
            result = self.classify(directory + file)
            if result == category:
                correct += 1
        return (correct, total)

    def test(self, testdir):
        """Test all files in the test directory--that directory is
        organized into subdirectories--each subdir is a classification
        category"""
        categories = os.listdir(testdir)
        #filter out files that are not directories
        categories = [filename for filename in categories if
                      os.path.isdir(testdir + filename)]
        correct = 0
        total = 0
        for category in categories:
            print(".", end="")
            (catCorrect, catTotal) = self.testCategory(
                testdir + category + '/', category)
            correct += catCorrect
            total += catTotal
        print("\n\nAccuracy is  %f%%  (%i test instances)" %
              ((float(correct) / total) * 100, total))
            
# change these to match your directory structure
baseDirectory = "20news-bydate/"
trainingDir = baseDirectory + "20news-bydate-train/"
testDir = baseDirectory + "20news-bydate-test/"


stoplistfile = "20news-bydate/stopwords0.txt"
print("Reg stoplist 0 ")
bT = BayesText(trainingDir, baseDirectory + "stopwords0.txt")
print("Running Test ...")
bT.test(testDir)

print("\n\nReg stoplist 25 ")
bT = BayesText(trainingDir, baseDirectory + "stopwords25.txt")
print("Running Test ...")
bT.test(testDir)

print("\n\nReg stoplist 174 ")
bT = BayesText(trainingDir, baseDirectory + "stopwords174.txt")
print("Running Test ...")
bT.test(testDir)

2、情感分析

from __future__ import print_function
import os, codecs, math

class BayesText:

    def __init__(self, trainingdir, stopwordlist, ignoreBucket):
        """This class implements a naive Bayes approach to text
        classification
        trainingdir is the training data. Each subdirectory of
        trainingdir is titled with the name of the classification
        category -- those subdirectories in turn contain the text
        files for that category.
        The stopwordlist is a list of words (one per line) will be
        removed before any counting takes place.
        """
        self.vocabulary = {}
        self.prob = {}
        self.totals = {}
        self.stopwords = {}
        f = open(stopwordlist)
        for line in f:
            self.stopwords[line.strip()] = 1
        f.close()
        categories = os.listdir(trainingdir)
        #filter out files that are not directories
        self.categories = [filename for filename in categories
                           if os.path.isdir(trainingdir + filename)]
        print("Counting ...")
        for category in self.categories:
            #print('    ' + category)
            (self.prob[category],
             self.totals[category]) = self.train(trainingdir, category,
                                                 ignoreBucket)
        # I am going to eliminate any word in the vocabulary
        # that doesn't occur at least 3 times
        toDelete = []
        for word in self.vocabulary:
            if self.vocabulary[word] < 3:
                # mark word for deletion
                # can't delete now because you can't delete
                # from a list you are currently iterating over
                toDelete.append(word)
        # now delete
        for word in toDelete:
            del self.vocabulary[word]
        # now compute probabilities
        vocabLength = len(self.vocabulary)
        #print("Computing probabilities:")
        for category in self.categories:
            #print('    ' + category)
            denominator = self.totals[category] + vocabLength
            for word in self.vocabulary:
                if word in self.prob[category]:
                    count = self.prob[category][word]
                else:
                    count = 1
                self.prob[category][word] = (float(count + 1)
                                             / denominator)
        #print ("DONE TRAINING\n\n")
                    

    def train(self, trainingdir, category, bucketNumberToIgnore):
        """counts word occurrences for a particular category"""
        ignore = "%i" % bucketNumberToIgnore
        currentdir = trainingdir + category
        directories = os.listdir(currentdir)
        counts = {}
        total = 0
        for directory in directories:
            if directory != ignore:
                currentBucket = trainingdir + category + "/" + directory
                files = os.listdir(currentBucket)
                #print("   " + currentBucket)
                for file in files:
                    f = codecs.open(currentBucket + '/' + file, 'r', 'iso8859-1')
                    for line in f:
                        tokens = line.split()
                        for token in tokens:
                            # get rid of punctuation and lowercase token
                            token = token.strip('\'".,?:-')
                            token = token.lower()
                            if token != '' and not token in self.stopwords:
                                self.vocabulary.setdefault(token, 0)
                                self.vocabulary[token] += 1
                                counts.setdefault(token, 0)
                                counts[token] += 1
                                total += 1
                    f.close()
        return(counts, total)
                    
                    
    def classify(self, filename):
        results = {}
        for category in self.categories:
            results[category] = 0
        f = codecs.open(filename, 'r', 'iso8859-1')
        for line in f:
            tokens = line.split()
            for token in tokens:
                #print(token)
                token = token.strip('\'".,?:-').lower()
                if token in self.vocabulary:
                    for category in self.categories:
                        if self.prob[category][token] == 0:
                            print("%s %s" % (category, token))
                        results[category] += math.log(
                            self.prob[category][token])
        f.close()
        results = list(results.items())
        results.sort(key=lambda tuple: tuple[1], reverse = True)
        # for debugging I can change this to give me the entire list
        return results[0][0]

    def testCategory(self, direc, category, bucketNumber):
        results = {}
        directory = direc + ("%i/" % bucketNumber)
        #print("Testing " + directory)
        files = os.listdir(directory)
        total = 0
        correct = 0
        for file in files:
            total += 1
            result = self.classify(directory + file)
            results.setdefault(result, 0)
            results[result] += 1
            #if result == category:
            #               correct += 1
        return results

    def test(self, testdir, bucketNumber):
        """Test all files in the test directory--that directory is
        organized into subdirectories--each subdir is a classification
        category"""
        results = {}
        categories = os.listdir(testdir)
        #filter out files that are not directories
        categories = [filename for filename in categories if
                      os.path.isdir(testdir + filename)]
        correct = 0
        total = 0
        for category in categories:
            #print(".", end="")
            results[category] = self.testCategory(
                testdir + category + '/', category, bucketNumber)
        return results

def tenfold(dataPrefix, stoplist):
    results = {}
    for i in range(0,10):
        bT = BayesText(dataPrefix, stoplist, i)
        r = bT.test(theDir, i)
        for (key, value) in r.items():
            results.setdefault(key, {})
            for (ckey, cvalue) in value.items():
                results[key].setdefault(ckey, 0)
                results[key][ckey] += cvalue
                categories = list(results.keys())
    categories.sort()
    print(   "\n       Classified as: ")
    header =    "          "
    subheader = "        +"
    for category in categories:
        header += "% 2s   " % category
        subheader += "-----+"
    print (header)
    print (subheader)
    total = 0.0
    correct = 0.0
    for category in categories:
        row = " %s    |" % category 
        for c2 in categories:
            if c2 in results[category]:
                count = results[category][c2]
            else:
                count = 0
            row += " %3i |" % count
            total += count
            if c2 == category:
                correct += count
        print(row)
    print(subheader)
    print("\n%5.3f percent correct" %((correct * 100) / total))
    print("total of %i instances" % total)

# change these to match your directory structure
prefixPath = "reviewPolarityBuckets/review_polarity_buckets/"
theDir = prefixPath + "/txt_sentoken/"
stoplistfile = prefixPath + "stopwords25.txt"
tenfold(theDir, stoplistfile)

（資料探勘-入門-8）基於樸素貝葉斯的文字分類器

主要內容： 1、動機 2、基於樸素貝葉斯的文字分類器 3、python實現一、動機之前介紹的樸素貝葉斯分類器所使用的都是結構化的資料集，即每行代表一個樣本，每列代表一個特徵屬性。但在實際中，尤其是網頁中，爬蟲所採集到的資料都是非結構化的，如新聞、微博、帖子等，如果要對對這一類資料進行分類，應該怎麼辦

樸素貝葉斯文本分類（詳解）

詞向量列表出現下標 put The 標註問題 else from numpy import zeros,array from math import log def loadDataSet(): #詞條切分後的文檔集合，列表每一行代表一個email p

資料探勘入門系列教程（四）之基於scikit-lean實現決策樹

資料探勘入門系列教程（四）之基於scikit-lean決策樹處理Iris載入資料集資料特徵訓練隨機森林調參工程師結尾資料探勘入門系列教程（四）之基於scikit-lean決策樹處理Iris 在上一篇部落格，我們介紹了決策樹的一些知識。如果對決策樹還不是很瞭解的話，建議先閱讀上一篇部落格，在來學習這

資料探勘入門系列教程（八）之使用神經網路（基於pybrain）識別數字手寫集MNIST

[TOC] ## 資料探勘入門系列教程（八）之使用神經網路（基於pybrain）識別數字手寫集MNIST 在本章節中，並不會對神經網路進行介紹，因此如果不瞭解神經網路的話，強烈推薦先去看《西瓜書》，或者看一下我的上一篇部落格：[資料探勘入門系列教程（七點五）之神經網路介紹](https://www.cnb

資料探勘入門系列教程（九）之基於sklearn的SVM使用

目錄介紹基於SVM對MINIST資料集進行分類使用SVMSVM分析垃圾郵件載入資料集分詞構建詞雲構建資料集進行訓練交叉驗證煉丹術總結參考介紹在上一篇部落格：資料探勘入門系列教程（八點五）之SVM介紹以及從零開始公式推導中，詳細的講述了SVM的原理，並進行了詳細的數學推導。在這篇部落格中，主要是應用SVM，

python資料探勘入門與實踐-----------通過親和力分析推薦電影（Apriori）

嚶~本節程式碼比著書上的原始碼看了一遍並加上了自己的理解註釋，但並沒有執行成功，因為他執行警告，我還不會改錯親和力分析：從頻繁出現的商品中選取共同出現額商品組成頻繁項集，生成關聯規則 import os import pandas as pd import sys #資料讀取 rating

python資料探勘入門與實踐--------轉換器（資料與處理）與流水線

y=MinMaxScaler().fit_transform(x) y與x為同型矩陣，y每列值的值域為0到1 sklearn.preprocessing.Normalizer 每條資料各特徵值的和為1 sklearn.preprocessing.StandardScaler 各特

python資料探勘入門與實踐--------電離層（Ionosphere）, scikit-learn估計器，K近鄰分類器，交叉檢驗，設定引數

ionosphere.data下載地址：http://archive.ics.uci.edu/ml/machine-learning-databases/ionosphere/ 原始碼及相關資料下載 https://github.com/xxg1413/MachineLea

python資料探勘入門與實戰——學習筆記（第3、4章）

chapter 3 決策樹預測獲勝球隊 pandas載入資料集 import pandas as pd dataset = pd.read_csv('filepath+filename') 資料清洗，可在讀入時清洗 dataset = pd.read_csv('filen

資料探勘入門系列教程（一）之親和性分析

資料探勘入門系列教程（一）之親和性分析教程系列簡介系列地址：https://www.cnblogs.com/xiaohuiduan/category/1661541.html 該教程為入門教程，為博主學習資料探勘的學習路徑步驟。教程為入門教程，從最簡單的開始。使用的程式語言為Pytho

資料探勘入門系列教程（二）之分類問題OneR演算法

資料探勘入門系列教程（二）之分類問題OneR演算法資料探勘入門系列部落格：https://www.cnblogs.com/xiaohuiduan/category/1661541.html 專案地址：GitHub 在上一篇部落格中，我們通過分析親和性來尋找資料集中資料與資料之間的相關關係。這篇部落

資料探勘入門系列教程（三）之scikit-learn框架基本使用（以K近鄰演算法為例）

資料探勘入門系列教程（三）之scikit-learn框架基本使用（以K近鄰演算法為例）簡介scikit-learn 估計器載入資料集進行fit訓練設定引數預處理流水線結尾資料探勘入門系列教程（三）之scikit-learn框架基本使用（以K近鄰演算法為例）資料探勘入門系列部落格：https://

資料探勘入門系列教程（三點五）之決策樹

## 資料探勘入門系列教程（三點五）之決策樹本來還是想像以前一樣，繼續學習《 Python資料探勘入門與實踐》的第三章“決策樹”，但是這本書上來就直接給我懟了一大串程式碼，對於`決策樹`基本上沒有什麼介紹，可直接把我給弄懵逼了，主要我只聽過決策樹還沒有認真的瞭解過它。這一章節主要是對決策樹做一個介紹

資料探勘入門系列教程（四點五）之Apriori演算法

[TOC] ## 資料探勘入門系列教程（四點五）之Apriori演算法 Apriori（先驗）演算法**關聯規則**學習的經典演算法之一，用來尋找出資料集中頻繁出現的資料集合。如果看過以前的部落格，是不是想到了這個跟[資料探勘入門系列教程（一）之親和性分析](https://www.cnblogs.com

資料探勘入門系列教程（五）之Apriori演算法Python實現

資料探勘入門系列教程（五）之Apriori演算法Python實現載入資料集獲得訓練集頻繁項的生成生成規則獲得support獲得confidence獲得Lift進行驗證總結參考資料探勘入門系列教程（五）之Apriori演算法Python實現在上一篇部落格中，我們介紹了Apriori演算法的演算法流

資料探勘入門系列教程（六）之資料集特徵選擇

目錄資料探勘入門系列教程（六）之資料集特徵選擇簡介載入資料集Adult 特徵選擇選擇最佳特徵總結資料探勘入門系列教程

資料探勘入門系列教程（七）之樸素貝葉斯進行文字分類

資料探勘入門系列教程（七）之樸素貝葉斯進行文字分類貝葉斯分類演算法是一類分類演算法的總和，均以貝葉斯定理為基礎，故稱之為貝葉斯分類。而樸素貝葉斯分類演算法就是其中最簡單的分類演算法。樸素貝葉斯分類演算法樸素貝葉斯分類演算法很簡單很簡單，就一個公式如下所示： P(B|A) = \frac{P(A|B) P

資料探勘入門系列教程（七點五）之神經網路介紹

[TOC] ## 資料探勘入門系列教程（七點五）之神經網路介紹這篇部落格是是為了下一篇部落格“使用神經網路破解驗證碼”做準備。主要是對神經網路的原理做介紹。同時這篇部落格主要是參考了西瓜書，如果身邊有西瓜書的同學，強烈建議直接去看西瓜書，至於我這篇部落格，你就當個樂子好了（因為你會發現內容與西瓜書很相似

資料探勘入門系列教程（八點五）之SVM介紹以及從零開始推導公式

目錄SVM介紹線性分類間隔最大間隔分類器拉格朗日乘子法（Lagrange multipliers）拉格朗日乘子法推導KKT條件（Karush-Kuhn-Tucker Conditions）拉格朗日乘子法對偶問題Slater 條件最大間隔分類器與拉格朗日乘子法核技巧核函式軟間隔軟間隔支援向量機推導SMO演算法S

資料探勘入門系列教程（十點五）之DNN介紹及公式推導

## 深度神經網路(DNN，Deep Neural Networks)簡介首先讓我們先回想起在之前部落格（[資料探勘入門系列教程（七點五）之神經網路介紹](https://www.cnblogs.com/xiaohuiduan/p/12623925.html)）中介紹的神經網路：為了解決M-P模型中無法處

（資料探勘-入門-8）基於樸素貝葉斯的文字分類器

一、動機

二、基於樸素貝葉斯的文字分類器

三、python實現

相關推薦