1. 程式人生 > >用Python統計文字檔案中詞彙字母短語等分佈

用Python統計文字檔案中詞彙字母短語等分佈

這是MSRA的高階軟體設計結對程式設計的作業

這篇部落格討論具體地實現方式與過程,包括效能分析與單元測試

分析的工具使用方法可以參考這兩篇部落格:

該專案的完整程式碼,請參考下面的Github:

先看一下這個專案的要求:

使用者需求:英語的26 個字母的頻率在一本小說中是如何分佈的?某型別文章中常出現的單詞是什麼?某作家最常用的詞彙是什麼?《哈利波特》 中最常用的短語是什麼,等等。 我們就寫一些程式來解決這個問題,滿足一下我們的好奇心。

要求:程式的單元測試,迴歸測試,效能測試C/C++/C# 等基本語言的運用和 debug。

題目要求:

Step-0:輸出某個英文文字檔案中 26 字母出現的頻率,由高到低排列,並顯示字母出現的百分比,精確到小數點後面兩位。

Step-1:輸出單個檔案中的前 N 個最常出現的英語單詞。

Step-2:支援 stop words,我們可以做一個 stop word 檔案 (停詞表), 在統計詞彙的時候,跳過這些詞。

Step-3:輸出某個英文文字檔案中 單詞短語出現的頻率,由高到低排列,並顯示字母出現的百分比,精確到小數點後面兩位。

Step-4:第四步:把動詞形態都統一之後再計數。 

Step-0:輸出某個英文文字檔案中 26 字母出現的頻率,由高到低排列,並顯示字母出現的百分比,精確到小數點後面兩位。

最初的想法是去除掉各種亂七八糟的符號之後,使用遍歷整個文字檔案的每一個字母,用一個字典儲存計數,每次去索引字典的值,索引到該值之後,在字典的value上加一實現。具體實現的程式碼如下:

#!/usr/bin/env python
#-*- coding:utf-8 -*-
#author: Enoch time:2018/10/22 0031

import time
import re
import operator
from string import punctuation           

start = time.clock()
'''function:Calculate the word frequency of each line
    input:  line : a list contains a string for a row
            counts: an empty  dictionary 
    ouput:  counts: a dictionary , keys are words and values are frequencies
    data:2018/10/22
'''
def ProcessLine(line,counts):
    #Replace the punctuation mark with a space

    line = re.sub('[^a-z]', '', line)
    for ch in line:
        counts[ch] = counts.get(ch, 0) + 1
    return counts

def main():
    file = open("../Gone With The Wind.txt", 'r')
    wordsCount = 0
    alphabetCounts = {}
    for line in file:
        alphabetCounts = ProcessLine(line.lower(), alphabetCounts)
    wordsCount = sum(alphabetCounts.values())
    alphabetCounts = sorted(alphabetCounts.items(), key=lambda k: k[0])
    alphabetCounts = sorted(alphabetCounts, key=lambda k: k[1], reverse=True)
    for letter, fre in alphabetCounts:
    	print("|\t{:15}|{:<11.2%}|".format(letter, fre / wordsCount))

    file.close()


if __name__ == '__main__':
    main()

end = time.clock()
print (end-start)

這樣做的程式碼理論上程式碼是正確的,為了驗證程式碼的正確性,我們需要使用三個文字檔案做單元測試,具體就是,一個空檔案,一個小樣本檔案,和一個樣本較多的檔案,分別做驗證,於是可以寫單元測試的程式碼如下:

from count import CountLetters
CountLetters("Null.txt")
CountLetters("Test.txt")
CountLetters("gone_with_the_wind.txt")

其中:

  • Null.txt 是一個空的文字檔案
  • gone_with_the_wind.txt 是《亂世佳人》的文字檔案
  • Test.txt 是一個我們自己指定的內容固定的文字檔案,這樣就可以統計結果的正確性

經過我們的驗證,這個結果是正確的。保證了結果的正確性,經過這樣的驗證,但還不清楚程式碼的覆蓋率怎麼樣,於是我們使用工具coverage,對程式碼進行分析,使用如下命令列分析程式碼覆蓋率

coverage run my_program.py arg1 arg2

得到的結果如下: 

Name                      Stmts   Exec  Cover
---------------------------------------------
CountLetters                 56     50    100%
---------------------------------------------
TOTAL                        56     50    100%

可以看到,在保證程式碼覆蓋率為100%的時候,程式碼執行是正確的。

但程式的執行速度怎麼樣呢?為了更加了解清楚它的執行速度,我們使用cprofile分析效能,從而提升執行的效能, 使用cprofile執行的結果為

我們大致知道main,Processline,ReplacePunctuations三個模組最耗時,其中最多是ProcessLine,我們就需要看preocessLine()模組裡呼叫了哪些函式,花費了多長時間。

最後使用圖形化工具graphviz畫出具體地耗時情況如下:

step0

可以從上面的影象中看到文字有9千多行,low函式和re.sub被呼叫了9023次,每個字母每個字母的統計get也被呼叫了1765982次,這種一個字母一個字母的索引方式太慢了。我們需要尋求新的解決辦法,於是想到了正則表示式,遍歷字母表來匹配正則表示式,於是我們就得到了第二版的函式

###################################################################################
#Name:count_letters
#Inputs:file name
#outputs:None
#Author: Thomas
#Date:2018.10.22
###################################################################################
def CountLetters(file_name,n,stopName,verbName):
    print("File name:" + os.path.abspath(file_name))
    if (stopName != None):
        stopflag = True
    else:
        stopflag = False
    if(verbName != None):
        print("Verb tenses normalizing is not supported in this function!")
    else:
        pass
    totalNum = 0
    dicNum = {}
    t0 = time.clock()
    if (stopflag == True):
        with open(stopName) as f:
            stoplist = f.readlines()
    with open(file_name) as f:
        txt = f.read().lower()
    for letter in letters:
        dicNum[letter] = len(re.findall(letter,txt))
        totalNum += dicNum[letter]
    if (stopflag == True):
        for word in stoplist:
            word = word.replace('\n','')
            try:
                del tempc[word]
            except:
                pass
    dicNum = sorted(dicNum.items(), key=lambda k: k[0])
    dicNum = sorted(dicNum, key=lambda k: k[1], reverse=True)
    t1 = time.clock()
    display(dicNum[:n],'character',totalNum,9)
    print("Time Consuming:%4f" % (t1 - t0))

該函式把執行時間從原來的1.14s直接降到了0.2s,通過重複剛才的單元測試以及效能分析(這裡我就不重複貼上結果了),驗證了在程式碼覆蓋率為100%的情況下,程式碼的執行也是正確的,並且發現執行時間最長的就是其中的正則表示式,在這樣的情況下,我們又尋求新的解決方案。最終我們發現了文字自帶的count方法,將正則表示式用更該方法替換之後,即將上面的程式碼:

dicNum[letter] = len(re.findall(letter,txt))

替換為

dicNum[letter] = txt.count(letter) #here count is faster than re

成功的將時間降到了5.83e-5s可以說提高了非常多的數量級,優化到這裡,基本上已經達到了優化的瓶頸,沒法繼續優化了。

注:後來的版本添加了許多功能,這裡的程式碼是添加了功能之後的程式碼, 如需要執行最初的功能則需要將後面的引數指定成None。

Step-1:輸出單個檔案中的前 N 個最常出現的英語單詞。

首先的瞭解,單詞的定義是什麼:

單詞:以英文字母開頭,由英文字母和字母數字符號組成的字串視為一個單詞。單詞以分隔符分割且不區分大小寫。在輸出時,所有單詞都用小寫字元表示。 

英文字母:A-Z,a-z 字母數字符號:A-Z,a-z,0-9 分割符:空格,非字母數字符號 例:good123是一個單詞,123good不是一個單詞。good,Good和GOOD是同一個單詞

最初的想法是去除掉各種亂七八糟的符號之後,是用空格分隔出單詞,然後遍歷文字中的每一個單詞,用一個字典儲存計數,每次去索引字典的值,索引到該值之後,在字典的value上加一實現。具體實現的程式碼如下:

#!/usr/bin/env python
#-*- coding:utf-8 -*-
#author: Eron time:2018/10/22 0022
import time
import re
start = time.time()
from string import punctuation           #Temporarily useless
 
'''function:Calculate the word frequency of each line
    input:  line : a list contains a string for a row
            counts: an empty  dictionary 
    ouput:  counts: a dictionary , keys are words and values are frequencies
    data:2018/10/22
'''
def ProcessLine(line,counts):
    #Replace the punctuation mark with a space
    #line=ReplacePunctuations(line)
    line = re.sub('[^a-z0-9]', ' ', line)
    words = line.split()
    for word in words:
        counts[word] = counts.get(word, 0) + 1
    return  counts


'''function:Replace the punctuation mark with a space
    input:  line : A list containing a row of original strings
    ouput:  line: a list whose punctuation is all replaced with spaces
    data:2018/10/22
'''
def ReplacePunctuations(line):
    for ch in line :
        #Create our own symbol list
        tags = [',','.','?','"','“','”','—']
        if ch in tags:
            line=line.replace(ch," ")
    return line

'''function:Create a taboo "stopwords.txt"
    input:  line : A list contains all the words in the "Gone With The Wind.txt"
    ouput:  nono
    data:2018/10/23
'''
def CreatStopWordsTxt(list):
    file = open('stopwords.txt', 'w')

    for  str in list:
        file.write(str+'\t')
    file.close()

'''function:Remove any words that do not meet the requirements
    input: dict : A dict whose keys are words and values are frequencies
    ouput: dictProc : A  removed undesirable words dict
    data:2018/10/23
'''
def RemoveUndesirableWords(dict):
    wordsCount = 0  # Number of words
    wordsCount = sum(dict.values())
    dictProc = dict.copy();
    for temp in list(dict):
        if temp[0].isdigit():
            del dictProc[temp]
        else:
            dictProc[temp] = round(dictProc[temp] / wordsCount, 4)
    return dictProc



def CountWords(fileName):
    file = open(fileName,'r')
    count = 10 #Show the top count  words that appear most frequently

    alphabetCountsOrg={}       # Creates an empty dictionary used to calculate word frequency

    for line in file:
        alphabetCountsOrg = ProcessLine(line.lower(), alphabetCountsOrg) #Calculate the word frequency of each line

    alphabetCounts = RemoveUndesirableWords(alphabetCountsOrg) #Remove any words that do not meet the requirements


    pairs = list(alphabetCounts.items())    #Get the key-value pairs from the dictionary
    items = [[x,y]for (y,x)in pairs]        #key-value pairs in the list exchange locations, data pairs sort
    items.sort(reverse=True)

    #Notice we didn't order words of the same frequency

    for i in range(count ):
        print(items[i][1] + "\t" + str(items[i][0]))
    file.close()
    #CreatStopWordsTxt(alphabetCounts.keys())

 
if __name__ == '__main__':
    CountWords("gone_with_the_wind.txt")

end = time.time()
print (end-start)

這樣做的程式碼理論上程式碼是正確的,為了驗證程式碼的正確性,我們需要使用三個文字檔案做單元測試,具體就是,一個空檔案,一個小樣本檔案,和一個樣本較多的檔案,分別做驗證,於是可以寫單元測試的程式碼如下:

from count import CountWords
CountWords("Null.txt")
CountWords("Test.txt")
CountWords("gone_with_the_wind.txt")

其中:

  • Null.txt 是一個空的文字檔案
  • gone_with_the_wind.txt 是《亂世佳人》的文字檔案
  • Test.txt 是一個我們自己指定的內容固定的文字檔案,這樣就可以統計結果的正確性

經過我們的驗證,這個結果是正確的。保證了結果的正確性,經過這樣的驗證,但還不清楚程式碼的覆蓋率怎麼樣,於是我們使用工具coverage,對程式碼進行分析,使用如下命令列分析程式碼覆蓋率

coverage run test.py

得到的結果如下: 

Name                      Stmts   Exec  Cover
---------------------------------------------
CountWords                   78     92    100%
---------------------------------------------
TOTAL                        78     92    100%

可以看到,在保證程式碼覆蓋率為100%的時候,程式碼執行是正確的。因為程式碼做了修改,因此需要做迴歸測試,編寫如下程式碼做迴歸測試:

from count import CountLetters
from count import CountWords
CountWords("Null.txt")
CountWords("Test.txt")
CountWords("gone_with_the_wind.txt")

CountLetters("Null.txt")
CountLetters("Test.txt")
CountLetters("gone_with_the_wind.txt")

但程式的執行速度怎麼樣呢?為了更加了解清楚它的執行速度,我們使用cprofile分析效能,從而提升執行的效能, 使用cprofile執行的結果為

我們大致知道sub,Split,get三個模組最耗時,其中最多是sub,我們就需要看preocessLine()模組裡呼叫了哪些函式,花費了多長時間。

最後使用圖形化工具graphviz畫出具體地耗時情況如下:

可以從上面的影象中看到文字有9千多行,low函式和re.sub被呼叫了9023次,每個字母每個字母的統計get也被呼叫了1765982次,這種一個單詞一個單詞的索引方式太慢了。我們需要尋求新的解決辦法,於是想到了正則表示式,遍歷字母表來匹配正則表示式,於是我們就得到了新的的函式,我們可以使用正則表示式的findall 函式,找到所有單詞,作為單詞list,使用collections 的Counter去統計字典中的重複元素,得到如下程式碼:

###################################################################################
#Name:count_words
#Inputs:file name,the first n words, stopfile name
#outputs:None
#Author: Thomas
#Date:2018.10.22
###################################################################################
def CountWords(file_name,n,stopName,verbName):
    print("File name:" + sys.path[0] + "\\" + file_name)
    if (stopName != None):
        stopflag = True
    else:
        stopflag = False
    if(verbName != None):
        verbflag = True
    else:
        verbflag = False
    t0 = time.clock()
    with open(file_name) as f:
        txt = f.read()
    txt = txt.lower()
    if(stopflag == True):
        with open(stopName) as f:
            stoplist = f.readlines()
    pattern = r"[a-z][a-z0-9]*"
    wordList = re.findall(pattern,txt)
    totalNum = len(wordList)
    tempc = Counter(wordList)
    if (stopflag == True):
        for word in stoplist:
            word = word.replace('\n','')
            del tempc[word]
    dicNum = dict(tempc.most_common(n))
    if (verbflag == True):
        totalNum = 0
        verbDic = {}
        verbDicNum = {}
        with open(verbName) as f:
            for line in f.readlines():
                key,value = line.split(' -> ')
                verbDic[key] = value.replace('\n','').split(',')
                verbDicNum[key] = tempc[key]
                for word in verbDic[key]:
                    verbDicNum[key] += tempc[word]
                totalNum += verbDicNum[key]
        verbDicNum = sorted(verbDicNum.items(), key=lambda k: k[0])
        verbDicNum = sorted(verbDicNum, key=lambda k: k[1], reverse=True)
    dicNum = sorted(dicNum.items(), key=lambda k:k[0])
    dicNum = sorted(dicNum, key=lambda k:k[1], reverse=True)
    t1 = time.clock()
    if (verbflag == True):
        display(verbDicNum[:n], 'words',totalNum,3)
    else:
        display(dicNum,'words',totalNum,3)
    print("Time Consuming:%4f" % (t1 - t0))

修改之後,依舊需要做單元測試和迴歸測試,這裡避免重複就不寫了,成功的將時間降到了0.34s可以說提高了非常多的數量級,優化到這裡,基本上已經達到了優化的瓶頸,沒法繼續優化了。

Step-2:支援 stop words,我們可以做一個 stop word 檔案 (停詞表), 在統計詞彙的時候,跳過這些詞。

停詞表就沒有之前實現那樣需要這麼麻煩去優化效能了,因為這個功能是基於之前的已經優化好的函式做的,因此要做的只是單元測試與迴歸測試,首先先分析一下實現的方式,因為之前已經統計得到了每一個單詞出現的次數,現在需要做的是讀取stopword檔案中的單詞,將這個單詞在字典中刪去,就可以到達最終所需要的效果,因為統計的時候用的是Counter型別的,因此只需要遍歷stopword然後在counter中刪掉就好了,這樣得到的程式碼就是:

if(stopflag == True):
    with open(stopName) as f:
        stoplist = f.readlines()    
if (stopflag == True):
    for word in stoplist:
        word = word.replace('\n','')
        del tempc[word]

同樣的,我們需要使用三個文字檔案做單元測試,具體就是,一個空檔案,一個小樣本檔案,和一個樣本較多的檔案,分別做驗證,於是可以寫單元測試的程式碼如下:

from count import CountWords
CountWords("Null.txt","Stopwords.txt")
CountWords("Test.txt","Stopwords.txt")
CountWords("gone_with_the_wind.txt","Stopwords.txt")

其中:

  • Null.txt 是一個空的文字檔案
  • gone_with_the_wind.txt 是《亂世佳人》的文字檔案
  • Test.txt 是一個我們自己指定的內容固定的文字檔案,這樣就可以統計結果的正確性

經過我們的驗證,這個結果是正確的。保證了結果的正確性,經過這樣的驗證,但還不清楚程式碼的覆蓋率怎麼樣,於是我們使用工具coverage,對程式碼進行分析,使用如下命令列分析程式碼覆蓋率

coverage run test.py

得到的結果如下: 

Name                      Stmts   Exec  Cover
---------------------------------------------
CountWords                   78     92    100%
---------------------------------------------
TOTAL                        78     92    100%

可以看到,在保證程式碼覆蓋率為100%的時候,程式碼執行是正確的。因為程式碼做了修改,因此需要做迴歸測試,編寫如下程式碼做迴歸測試:

from count import CountLetters
from count import CountWords
CountWords("Null.txt","Stopwords.txt")
CountWords("Test.txt","Stopwords.txt")
CountWords("gone_with_the_wind.txt","Stopwords.txt")

CountLetters("Null.txt","Stopwords.txt")
CountLetters("Test.txt","Stopwords.txt")
CountLetters("gone_with_the_wind.txt","Stopwords.txt")

發現之前的counterletters不支援stopword的功能,於是我們又去修改了該函式,只不過那個函式沒有用counter型別,因此為了達到stopword功能,需要從字典中刪去改項,於是我們得到

if (stopflag == True):
    with open(stopName) as f:
        stoplist = f.readlines()    
if (stopflag == True):
    for word in stoplist:
        word = word.replace('\n','')
        try:
            del tempc[word]
        except:
            pass

經過單元測試,迴歸測試之後,結果正確。

Step-3:輸出某個英文文字檔案中 單詞短語出現的頻率,由高到低排列,並顯示字母出現的百分比,精確到小數點後面兩位。

首先的瞭解,短語的定義是什麼:

短語:兩個或多個英語單詞, 它們之間只有空格分隔.   請看下面的例子:

  hello world   //這是一個短語

  hello, world //這不是一個短語 

這個會導致一個句子中有許多短語,舉個例子:

I am not a good boy.

這個就有:I am, am not, not a, a good, good boy.

這就難倒了正則表示式,因為這樣就不能用回溯功能,於是隊友想到了辦法,我們把文章先分為句子,再從句子中提出短語,用for迴圈去遍歷一個句子,然後我們寫出了下面的程式碼:

#!/usr/bin/env python
#-*- coding:utf-8 -*-
#author: albert time:2018/10/23 0023

import time
import re
import string
from collections import Counter

start = time.time()
from string import punctuation  # Temporarily useless

def NumWordFrequency(fileContent,number):
    fileContent = re.sub('\n|\t',' ',fileContent)
    mPunctuation = r',|;|\?|\!|\.|\:|\“|\"|\”'
    sentenceList = re.split(mPunctuation , fileContent)#Divide the text into sentences according to the punctuation marks
    wordsCounts = {}  # Creates an empty dictionary used to calculate word frequency
    for oneSentence in sentenceList:
        wordsCounts = ProcessLine(oneSentence.lower(), wordsCounts,number)  # Calculate the specified length phrase frequency
    return wordsCounts


'''function:Calculate the word frequency of each line
    input:  line : a list contains a string for a row
            countsDict: an empty  dictionary 
    ouput:  counts: a dictionary , keys are words and values are frequencies
    data:2018/10/22
'''

def ProcessLine(sentence, countsDict,number):
    # Replace the punctuation mark with a space
    # line=ReplacePunctuations(line)
    sentence = re.sub('[^a-z0-9]', ' ', sentence)
    words = sentence.split()
    if len(words) >= number:
        for i in range(len(words)-number+1):
            countsDict[" ".join(words[i:i+number])] = countsDict.get(" ".join(words[i:i+number]), 0) + 1
    else:
        if sentence.strip()=='':   #Judge if the sentence is empty
            return countsDict
        countsDict[sentence] = countsDict.get(sentence, 0) + 1
    return countsDict


'''function:Replace the punctuation mark with a space
    input:  line : A list containing a row of original strings
    ouput:  line: a list whose punctuation is all replaced with spaces
    data:2018/10/22
'''

def ReplacePunctuations(line):
    for ch in line:
        # Create our own symbol list
        tags = [',', '.', '?', '"', '“', '”', '—']
        if ch in tags:
            line = line.replace(ch, " ")
    return line


'''function:Create a taboo "stopwords.txt"
    input:  line : A list contains all the words in the "Gone With The Wind.txt"
    ouput:  nono
    data:2018/10/23
'''

def CreatStopWordsTxt(list):
    file = open('stopwords.txt', 'w')

    for str in list:
        file.write(str + '\t')
    file.close()

'''function:Remove any words that do not meet the requirements
    input: dict : A dict whose keys are words and values are frequencies
    ouput: dict : A  removed undesirable words dict
    data:2018/10/23
'''
def RemoveUndesirableWords(dict):
    '''
        wordsCount = 0  # Number of words
        wordsCount = sum(dict.values())
    '''
    listKey = list(dict)
    for temp in listKey:
        if temp[0].isdigit():
            del dict[temp]
        #else:
           # dict[temp] = round(dict[temp] , 4)
    return dict

'''function:Remove the words from the "stopwords.txt"
    input: dict : A list transformed by a dict whose keys are words and values are frequencies
    ouput: dictProc : A list after removing stopwords
    data:2018/10/23
'''

def StopWordProcessing(dict):
    fileTabu = open("stopwords1.txt", 'r')
    stopWordlist = fileTabu.read()
    fileTabu.close()

    stopWordlist = re.sub('[^a-z0-9]', ' ', stopWordlist).split(' ')
    dictProc = dict.copy()
    for temp in dict.keys():
        if temp.strip() in stopWordlist:
            del dictProc[temp]
    return dictProc

class WordFinder(object):
    '''A compound structure of dictionary and set to store word mapping'''
    def __init__(self):

        self.mainTable = {}
        for char in string.ascii_lowercase:
            self.mainTable[char] = {}
        self.specialTable = {}
        #print(self.mainTable)
        for headword, related in lemmas.items():
            # Only 3 occurrences of uppercase in lemmas.txt, which include 'I'
            # Trading precision for simplicity
            headword = headword.lower()
            try:
                related = related.lower()
            except AttributeError:
                related = None
            if related:
                for word in related.split():
                    if word[0] != headword[0]:
                        self.specialTable[headword] = set(related.split())
                        break
                    else:
                        self.mainTable[headword[0]][headword] = set(related.split())
            else:
                self.mainTable[headword[0]][headword] = None
        #print(self.specialTable)
        #print(self.mainTable)
    def find_headword(self, word):
        """Search the 'table' and return the original form of a word"""
        word = word.lower()
        alphaTable = self.mainTable[word[0]]
        if word in alphaTable:
            return word

        for headword, related in alphaTable.items():
            if related and (word in related):
                return headword

        for headword, related in self.specialTable.items():
            if word == headword:
                return word
            if word in related:
                return headword
        # This should never happen after the removal of words not in valid_words
        # in Book.__init__()
        return None

    # TODO
    def find_related(self, headword):
        pass


def VerbTableFrequency(fileContent):
    global lemmas
    global  allVerbWords
    lemmas = {}
    allVerbWords = set()
    with open('verbs.txt') as fileVerb:
        # print(fileVerb.read())
        for line in fileVerb:
            # print(line)
            line = re.sub(r'\n|\s|\,', ' ', line)
            headWord = line.split('->')[0]
            # print(headWord)
            # print(headWord)
            try:
                related = line.split('->')[1]
                # print(related)

            except IndexError:
                related = None
            lemmas[headWord] = related

    allVerbWords = set()
    for headWord, related in lemmas.items():
        allVerbWords.add(headWord)
        # print(allVerbWords)
        # print("\t")
        if related:
            allVerbWords.update(set(related.split()))
            # allVerbWords.update(related)

    tempList = re.split(r'\b([a-zA-Z-]+)\b',fileContent)
    tempList = [item for item in tempList if (item in allVerbWords)]
    finder = WordFinder()
    tempList = [finder.find_headword(item) for item in tempList]

    cnt = Counter()
    for word in tempList:
        cnt[word] += 1
    #print(type(cnt))
    return cnt

def main():
    with open("Gone With The Wind.txt") as file :
        content = file.read().lower()

    outCounts = 10  # Show the top count  words that appear most frequently
    number = 1  #Phrase length
    flag = 1


    if flag == 1:
        verbFreCount = VerbTableFrequency(content)
        #print(type(cnt))

        wordsCounts ={}
        for word in sorted(verbFreCount, key=lambda x: verbFreCount[x], reverse=True):
            wordsCounts[word] = verbFreCount[word]
        print(type(wordsCounts))
        freCountNum = sum(wordsCounts.values())

        #print (freCountNum )
        for word, fre in list(wordsCounts.items())[0:outCounts]:
            print("|\t{:15}|{:<11.2f}|".format(word,fre / freCountNum))
        print("--------------------------------")


    else:
        wordsCounts = NumWordFrequency(content,number)
        wordsCounts = RemoveUndesirableWords(wordsCounts)  # Remove any words that do not meet the requirements
        wordsCounts = StopWordProcessing(wordsCounts)  # Remove the words from the "stopwords.txt"

        pairsList = list(wordsCounts.items())  # Get the key-value pairsList from the dictionary
        items = [[x, y] for (y, x) in pairsList]  # key-value pairsList in the list exchange locations, data pairsList sort
        items.sort(reverse=True)
        # Notice we didn't order words of the same frequency
        for i in range(outCounts):
            print(items[i][1] + "\t" + str(items[i][0]))


if __name__ == '__main__':
    main()

end = time.time()
print(end - start)

同樣的,我們需要使用三個文字檔案做單元測試,具體就是,一個空檔案,一個小樣本檔案,和一個樣本較多的檔案,分別做驗證,於是可以寫單元測試的程式碼如下:

from count import CountPhrase
CountPhrase("Null.txt",2)
CountPhrase("Test.txt",2)
CountPhrase("gone_with_the_wind.txt",2)

CountPhrase("Null.txt",2,"Stopwords.txt")
CountPhrase("Test.txt",2,"Stopwords.txt")
CountPhrase("gone_with_the_wind.txt",2,"Stopwords.txt")

其中:

  • Null.txt 是一個空的文字檔案
  • gone_with_the_wind.txt 是《亂世佳人》的文字檔案
  • Test.txt 是一個我們自己指定的內容固定的文字檔案,這樣就可以統計結果的正確性

經過我們的驗證,這個結果是正確的。保證了結果的正確性,經過這樣的驗證,但還不清楚程式碼的覆蓋率怎麼樣,於是我們使用工具coverage,對程式碼進行分析,使用如下命令列分析程式碼覆蓋率

coverage run test.py

得到的結果如下: 

Name                      Stmts   Exec  Cover
---------------------------------------------
CountPhrase                  78     92    100%
---------------------------------------------
TOTAL                        78     92    100%

可以看到,在保證程式碼覆蓋率為100%的時候,程式碼執行是正確的。因為程式碼做了修改,因此需要做迴歸測試,編寫如下程式碼做迴歸測試:

from count import CountLetters
from count import CountWords
from count import CountPhrase

CountWords("Null.txt","Stopwords.txt")
CountWords("Test.txt","Stopwords.txt")
CountWords("gone_with_the_wind.txt","Stopwords.txt")

CountLetters("Null.txt","Stopwords.txt")
CountLetters("Test.txt","Stopwords.txt")
CountLetters("gone_with_the_wind.txt","Stopwords.txt")

CountPhrase("Null.txt",2)
CountPhrase("Test.txt",2)
CountPhrase("gone_with_the_wind.txt",2)

CountPhrase("Null.txt",2,"Stopwords.txt")
CountPhrase("Test.txt",2,"Stopwords.txt")
CountPhrase("gone_with_the_wind.txt",2,"Stopwords.txt")

發現之前的counterPhrases不支援stopword的功能,於是我們又去修改了該函式,思想和CountWords函式相同。

經過單元測試,迴歸測試之後,結果正確。

但程式的執行速度怎麼樣呢?為了更加了解清楚它的執行速度,我們使用cprofile分析效能,從而提升執行的效能, 使用cprofile執行的結果為,一共用了2.39s,為了降低時間成本。

因此需要對其進行優化,我們想到一個絕妙的辦法,可以將文章看作一個巨大的句子,用句號對文中的句子進行分割,然後,用正則表示式匹配第一次,這一次就會漏掉一些,但是我們刪去一個詞再去用正則表示式,就可以統計到缺失的那部分,同樣的,一直替換到刪去n-1個詞語,就得到最終版本的程式碼:

###################################################################################
#Name:count_words
#Inputs:file name,the first n words, stopfile name
#outputs:None
#Author: Thomas
#Date:2018.10.22
###################################################################################
def CountPhrases(file_name,n,stopName,verbName,k):
    print("File name:" + sys.path[0] + "\\" + file_name)
    totalNum = 0
    if (stopName != None):
        stopflag = True
    else:
        stopflag = False
    if(verbName != None):
        verbflag = True
    else:
        verbflag = False
    t0 = time.clock()
    with open(file_name) as f:
        txt = f.read()
    txt = txt.lower()
    txt = re.sub(r'[\s|\']+',' ',txt)
    pword = r'(([a-z]+ )+[a-z]+)'  # extract sentence
    pattern = re.compile(pword)
    sentence = pattern.findall(txt)
    txt = ','.join([sentence[m][0] for m in range(len(sentence))])
    if(stopflag == True):
        with open(stopName) as f:
            stoplist = f.readlines()
    pattern = "[a-z]+[0-9]*"
    for i in range(k-1):
        pattern += "[\s|,][a-z]+[0-9]*"
    wordList = []
    for i in range(k):
        if( i == 0 ):
            tempList = re.findall(pattern, txt)
        else:
            wordpattern = "[a-z]+[0-9]*"
            txt = re.sub(wordpattern, '', txt, 1).strip()
            tempList = re.findall(pattern, txt)
        wordList += tempList
    tempc = Counter(wordList)
    if (stopflag == True):
        for word in stoplist:
            word = word.replace('\n','')
            del tempc[word]
    dicNum = {}
    if (verbflag == True):
        verbDic = {}
        with open(verbName) as f:
            for line in f.readlines():
                key,value = line.split(' -> ')
                for tverb in value.replace('\n', '').split(','):
                    verbDic[tverb] = key
                verbDic[key] = key
        for phrase in tempc.keys():
            if (',' not in phrase):
                totalNum += 1
                verbList = phrase.split(' ')
                normPhrase = verbList[0]
                for verb in verbList[1:]:
                    if verb in verbDic.keys():
                        verb = verbDic[verb]
                    normPhrase += ' ' + verb
                if (normPhrase in dicNum.keys()):
                    dicNum[normPhrase] += tempc[phrase]
                else:
                    dicNum[normPhrase] = tempc[phrase]
    else:
        phrases = tempc.keys()
        for phrase in phrases:
            if (',' not in phrase):
                dicNum[phrase] = tempc[phrase]
                totalNum += tempc[phrase]
    dicNum = sorted(dicNum.items(), key=lambda k: k[0])
    dicNum = sorted(dicNum, key=lambda k: k[1], reverse=True)
    t1 = time.clock()
    display(dicNum[:n], 'Phrases',totalNum,3)
    print("Time Consuming:%4f" % (t1 - t0))

經過執行上面的單元測試,迴歸測試的程式碼,發現執行結果沒有變化,時間降到了1.8s,已經達到優化的最終目的了。

Step-4:第四步:把動詞形態都統一之後再計數。

首先,我們需要看一下動詞形態在Verbs.txt中是什麼樣子的

abandon -> abandons,abandoning,abandoned abase -> abases,abasing,abased abate -> abates,abating,abated abbreviate -> abbreviates,abbreviating,abbreviated abdicate -> abdicates,abdicating,abdicated abduct -> abducts,abducting,abducted abet -> abets,abetting,abetted abhor -> abhors,abhorring,abhorred

可以看到左邊是動詞原形,右邊是動詞的各種形式,因為目前已經對單詞全部統計出來了,所以現在需要做的是,首先將verbs.txt讀入字典當中,用這個字典將相同詞語不同形式的加到一起,於是可以編寫程式碼如下:

if (verbflag == True):
    totalNum = 0
    verbDic = {}
    verbDicNum = {}
    with open(verbName) as f:
        for line in f.readlines():
            key,value = line.split(' -> ')
            verbDic[key] = value.replace('\n','').split(',')
            verbDicNum[key] = tempc[key]
            for word in verbDic[key]:
                verbDicNum[key] += tempc[word]
            totalNum += verbDicNum[key]

同樣的,我們需要使用三個文字檔案做單元測試,具體就是,一個空檔案,一個小樣本檔案,和一個樣本較多的檔案,分別做驗證,於是可以寫單元測試的程式碼如下:

from count import CountWords,CountPhrases
CountWords("Null.txt","Verbs.txt")
CountWords("Test.txt","Verbs.txt")
CountWords("gone_with_the_wind.txt","Verbs.txt")

CountWords("Null.txt","Verbs.txt","Verbs.txt","stopwords.txt")
CountWords("Test.txt","Verbs.txt","Verbs.txt","stopwords.txt")
CountWords("gone_with_the_wind.txt","Verbs.txt","stopwords.txt")

CountWords("Null.txt","Verbs.txt")
CountWords("Test.txt","Verbs.txt")
CountWords("gone_with_the_wind.txt","Verbs.txt")

CountWords("Null.txt","Verbs.txt","Verbs.txt""stopphrases.txt")
CountWords("Test.txt","Verbs.txt","Verbs.txt""stopphrases.txt")
CountWords("gone_with_the_wind.txt","Verbs.txt","stopphrases.txt")

其中:

  • Null.txt 是一個空的文字檔案
  • gone_with_the_wind.txt 是《亂世佳人》的文字檔案
  • Test.txt 是一個我們自己指定的內容固定的文字檔案,這樣就可以統計結果的正確性

對於單詞來說經過我們的驗證,這個結果是正確的。但發現短語不支援verbs.txt的功能,於是我們對短語的功能進行了修改,但是怎麼歸一化呢,想到了一個絕妙的辦法,就是各種形式作為key,對應值作為value,這樣的話索引各種形式都可以變換到原型,然後就有了如下的程式碼:

    if (verbflag == True):
        verbDic = {}
        with open(verbName) as f:
            for line in f.readlines():
                key,value = line.split(' -> ')
                for tverb in value.replace('\n', '').split(','):
                    verbDic[tverb] = key
                verbDic[key] = key
        for phrase in tempc.keys():
            if (',' not in phrase):
                totalNum += 1
                verbList = phrase.split(' ')
                normPhrase = verbList[0]
                for verb in verbList[1:]:
                    if verb in verbDic.keys():
                        verb = verbDic[verb]
                    normPhrase += ' ' + verb
                if (normPhrase in dicNum.keys()):
                    dicNum[normPhrase] += tempc[phrase]
                else:
                    dicNum[normPhrase] = tempc[phrase]

經過這樣的驗證,但還不清楚程式碼的覆蓋率怎麼樣,於是我們使用工具coverage,對程式碼進行分析,使用如下命令列分析程式碼覆蓋率

coverage run test.py

得到的結果如下: 

Name                      Stmts   Exec  Cover
---------------------------------------------
CountWords                   78     92    100%
---------------------------------------------
TOTAL                        78     92    100%

可以看到,在保證程式碼覆蓋率為100%的時候,程式碼執行是正確的。因為程式碼做了修改,因此需要做迴歸測試,編寫如下程式碼做迴歸測試:

from count import CountLetters
from count import CountWords
CountWords("Null.txt","Verbs.txt")
CountWords("Test.txt","Verbs.txt")
CountWords("gone_with_the_wind.txt","Verbs.txt")

CountWords("Null.txt","Verbs.txt","stopwords.txt")
CountWords("Test.txt","Verbs.txt","stopwords.txt")
CountWords("gone_with_the_wind.txt","Verbs.txt","stopwords.txt")

CountWords("Null.txt")
CountWords("Test.txt")
CountWords("gone_with_the_wind.txt")

CountLetters("Null.txt","Verbs.txt","Stopwords.txt")
CountLetters("Test.txt","Verbs.txt","Stopwords.txt")
CountLetters("gone_with_the_wind.txt","Verbs.txt","Stopwords.txt")

CountLetters("Null.txt","Stopwords.txt")
CountLetters("Test.txt","Stopwords.txt")
CountLetters("gone_with_the_wind.txt","Stopwords.txt")

CountLetters("Null.txt")
CountLetters("Test.txt")
CountLetters("gone_with_the_wind.txt")

發現之前的counterletters不支援verbs.txt的功能,於是我們又去修改了該函式,但後來覺得歸一化單詞去統計字母的出現次數是沒有意義的,於是便刪去了原先程式碼。

Step-5:第五步:統計動介短語的次數。

首先先看一下動介短語的定義是什麼:

VerbPhrase := Verb + Spaces + Preposition Spaces := Space+ Space := ' ' | '\t' | '\r' | '\n' Preposition := <any one from the list of prepositions> Verb := <any one in any tense FROM THE DICTIONARY>

一開始並沒有想到第5步與第4步有緊密的聯絡,因此我們這步的程式碼是從頭開始寫的,構造了一個非常長的正則表示式,主要就是用for迴圈將詞語用或連起來,因為這樣的用時太長了,一共花了56s,可以說根本沒法用,因此直接就摒棄了這種方式,也沒有做單元測試效能分析,因為時間太長了,肯定需要重新想的。後來想起來第4步不是統計了所有的短語嘛,我們可以將統計的短語拿過來使用,只要歸一化再加上判斷介詞就可以用了。但是怎麼歸一化呢,想到了一個絕妙的辦法,就是各種形式作為key,對應值作為value,這樣的話索引各種形式都可以變換到原型,這樣得到最終的程式碼:

###################################################################################
#Name:count_words
#Inputs:file name,the first n words, stopfile name
#outputs:None
#Author: Thomas
#Date:2018.10.22
###################################################################################
def CountVerbPre(file_name,n,stopName,verbName,preName):
    print("File name:" + sys.path[0] + "\\" + file_name)
    dicNum = {}
    totalNum = 0
    if (stopName != None):
        stopflag = True
    else:
        stopflag = False
    t0 = time.clock()
    with open(file_name) as f:
        txt = f.read()
    txt = txt.lower()
    txt = re.sub(r'[\s|\']+',' ',txt)
    pword = r'(([a-z]+ )+[a-z]+)'  # extract sentence
    pattern = re.compile(pword)
    sentence = pattern.findall(txt)
    txt = ','.join([sentence[m][0] for m in range(len(sentence))])
    if(stopflag == True):
        with open(stopName) as f:
            stoplist = f.readlines()
    pattern = "[a-z]+[0-9]*"
    for i in range(1):
        pattern += "[\s|,][a-z]+[0-9]*"
    wordList = []
    for i in range(2):
        if( i == 0 ):
            tempList = re.findall(pattern, txt)
        else:
            wordpattern = "[a-z]+[0-9]*"
            txt = re.sub(wordpattern, '', txt, 1).strip()
            tempList = re.findall(pattern, txt)
        wordList += tempList

    tempc = Counter(wordList)
    with open(preName) as f:
        preTxt = f.read()
    preList = preTxt.split('\n')
    verbDic = {}
    with open(verbName) as f:
        for line in f.readlines():
            key,value = line.split(' -> ')
            for tverb in value.replace('\n','').split(','):
                verbDic[tverb] = key
            verbDic[key] = key
    for phrase in tempc.keys():
        if(',' not in phrase):
            totalNum += 1
            verb, pre = phrase.split(' ')
            if (verb in verbDic.keys() and pre in preList):
                normPhrase = verbDic[verb] + ' ' + pre
                if (normPhrase in dicNum.keys()):
                    dicNum[normPhrase] += tempc[phrase]
                else:
                    dicNum[normPhrase] = tempc[phrase]
    if (stopflag == True):
        for word in stoplist:
            word = word.replace('\n','')
            del dicNum[word]
    dicNum = sorted(dicNum.items(), key=lambda k: k[0])
    dicNum = sorted(dicNum, key=lambda k: k[1], reverse=True)
    t1 = time.clock()
    display(dicNum[:n], 'VerbPre',totalNum, 3)
    print("Time Consuming:%4f"%(t1-t0))