軟工作業4：詞頻統計 ——基本功能

阿新 • • 發佈：2018-10-21

結對編程 arm 如何溝通如果單詞 for循環路徑決策 match

一、基本信息

1、編譯環境：Pycharm2018、Python3.8

2、作者：1613072007 周磊

1613072008 俞林森

3、項目地址：https://gitee.com/ntucs/PairProg.git

二、項目分析

1.1、讀文件到緩沖區 process_file(dst，f)

def process_file(dst, f):  # 讀文件到緩沖區
    try:  # 打開文件
        doc = open(dst, ‘r‘)
    except IOError as s:
        print(s)
        return None
     
try:  # 讀文件到緩沖區
        bvffer = doc.read()
    except:
        print("Read File Error!")
        return None
    doc.close()
    return bvffer

1.2、統計行數 process_line(dst, f)

def process_line(dst, f): #統計行數
    count = 0
    for line in open(dst, ‘r‘).readlines():
        if line != ‘‘ and line != ‘ 
\n‘:
            count += 1
    print(‘lines:‘, count, file=f)

1.3、統計單詞 process_line(dst, f)

def process_buffer(bvffer, f):
   if bvffer:
        word_freq = {}
        # 下面添加處理緩沖區 bvffer代碼，統計每個單詞的頻率，存放在字典word_freq
        for ch in ‘“‘!;,.?”‘:  # 將文本內容都改為小寫且除去文本中的中英文標點符號
            bvffer = bvffer.lower().replace(ch, " 
 ")
        bvffer = bvffer.strip().split()  # strip()刪除空白符；split()以空格分割字符串
        regex = "^[a-z]{4}(\w)*"
        words = []
        for word in bvffer:  # 判定是否是符合單詞設定
            if re.match(regex, word):
                words.append(word)
        print(‘words:‘, len(words), file=f)  # 輸出單詞總數
        for word in words:  # 獲取字典
            word_freq[word] = word_freq.get(word, 0) + 1
        for key in list(word_freq):  # 刪除一些常用單詞
            if key in st1:
                del word_freq[key]
        return word_freq

1.4、統計兩個單詞的詞組 process_higth2(bvffer, f)

def process_higth2(bvffer, f): #統計兩個單詞詞組
    Phrase = []
    Phrase_freq = {}
    words = bvffer.strip().split()#單詞分割
    for y in range(len(words) - 1):
        if words[y][-1] in ‘’“‘!;,.?”‘ or words[y + 1][0] in ‘’“‘!;,.?”‘:  # 判斷兩個單詞之間是否有其他符號
            continue
        elif words[y][0] in ‘’“‘!;,.?”‘:  # 判斷第一個單詞前是否有符號
            words[y] = words[y][1:]
        elif words[y + 1][-1] in ‘’“‘!;,.?”‘:  # 判斷第二個單詞後是否有符號
            words[y + 1] = words[y + 1][:len(words[y + 1]) - 1]
        Phrase.append(words[y] + ‘ ‘ + words[y + 1])  # 錄入列表Phrase
    for ph in Phrase:
        Phrase_freq[ph] = Phrase_freq.get(ph, 0) + 1  # 生成詞組字典
    return Phrase_freq

1.5、統計三個單詞的詞組 process_higth3(bvffer, f)

def process_higth3(bvffer, f):#統計三個單詞詞組
    Phrase = []
    Phrase_freq1 = {}
    words = bvffer.strip().split()  # 單詞分割
    for y in range(len(words) - 2):
        if words[y][-1] in ‘’“‘!;,.?”‘ or words[y + 1][0] in ‘’“‘!;,.?”‘:
            continue
        elif words[y + 1][-1] in ‘’“‘!;,.?”‘ or words[y + 2][0] in ‘’“‘!;,.?”‘:
            continue
        elif words[y][0] in ‘’“‘!;,.?”‘:
            words[y] = words[y][1:]
        elif words[y + 1][-1] in ‘’“‘!;,.?”‘:
            words[y + 2] = words[y + 2][:len(words[y + 2]) - 1]
        Phrase.append(words[y] + ‘ ‘ + words[y + 1] + ‘ ‘ + words[y + 2])  # 錄入列表Phrase
    for ph in Phrase:
        Phrase_freq1[ph] = Phrase_freq1.get(ph, 0) + 1  # 生成詞組字典
    return Phrase_freq1

1.6、輸出單詞 output_result(word_freq, f)

def output_result(word_freq, f):#輸出單詞
    if word_freq:
        sorted_word_freq = sorted(word_freq.items(), key=lambda v: v[1], reverse=True)
        for item in sorted_word_freq[:10]:  # 輸出 Top 10 的單詞
            print(item[0], ‘:‘, item[1], file=f)

1.7、主函數 main()

def main():
    dst = "Gone_with_the_wind.txt"  # A_Tale_of_Two_Cities  Gone_with_the_wind
    f = open(‘result.txt‘, ‘w‘)  # 寫入結果路徑
    process_line(dst, f)
    bvffer = process_file(dst, f)  # 讀文件到緩沖區
    word_freq = process_buffer(bvffer, f)  # 生成單詞字典
    Phrase_freq2 = process_higth2(bvffer, f)  # 生成詞組字典
    Phrase_freq3 = process_higth3(bvffer, f)  # 生成詞組字典

    output_result(word_freq, f)  # 輸出單詞前10
    print(‘雙詞組前十詞組：‘, file=f)
    output_result(Phrase_freq2, f)  # 輸出雙詞組前10
    print(‘三詞組前十詞組：‘, file=f)
    output_result(Phrase_freq3, f)  # 輸出雙詞組前10

1.8、性能測試

if __name__ == "__main__":
    cProfile.run("main()",filename="result")
    p=pstats.Stats("result")
    p.strip_dirs().sort_stats("calls").print_stats(10)
    p.strip_dirs().sort_stats("cumulative", "name").print_stats(10)

2.1、時間復雜度和空間復雜度

時間復雜度和空間復雜度的計算以下面這段代碼為例

def process_higth2(bvffer, f): #統計兩個單詞詞組
    Phrase = []
    Phrase_freq = {}
    words = bvffer.strip().split()#單詞分割
    for y in range(len(words) - 1):
        if words[y][-1] in ‘’“‘!;,.?”‘ or words[y + 1][0] in ‘’“‘!;,.?”‘:  # 判斷兩個單詞之間是否有其他符號
            continue
        elif words[y][0] in ‘’“‘!;,.?”‘:  # 判斷第一個單詞前是否有符號
            words[y] = words[y][1:]
        elif words[y + 1][-1] in ‘’“‘!;,.?”‘:  # 判斷第二個單詞後是否有符號
            words[y + 1] = words[y + 1][:len(words[y + 1]) - 1]
        Phrase.append(words[y] + ‘ ‘ + words[y + 1])  # 錄入列表Phrase
    for ph in Phrase:
        Phrase_freq[ph] = Phrase_freq.get(ph, 0) + 1  # 生成詞組字典
    return Phrase_freq

因為兩個for循環不是嵌套的，因此時間復雜度為O（n)。時間復雜度為O(1)。

3.1、程序運行案例截圖

技術分享圖片

停詞表：

技術分享圖片

三、性能分析

我們大概花了一個多小時在提高程序性能上，原本運行時間是10秒second，經過代碼改進後運行時間提升至3.341second。

主要修改代碼如下：

def process_buffer(bvffer, f):
    if bvffer:
        word_freq = {}
        # 下面添加處理緩沖區 bvffer代碼，統計每個單詞的頻率，存放在字典word_freq
        for ch in ‘“‘!;,.?”‘:  # 將文本內容都改為小寫且除去文本中的中英文標點符號
            bvffer = bvffer.lower().replace(ch, " ")
            words = bvffer.strip().split()  # strip()刪除空白符；split()以空格分割字符串
        for word in words:  # 判定是否是符合單詞設定
            y = 1
            s = True
            for x in word:
                if y > 4:
                    break
                if (x < ‘a‘ or x > ‘z‘) and x not in ‘’‘:
                    s = False
                    break
                y = y + 1
            if s == False:
                words.remove(word)
        print(‘words:‘, len(words), file=f)  # 輸出單詞總數
        for word in words:  # 獲取字典
            word_freq[word] = word_freq.get(word, 0) + 1
        for key in list(word_freq):  # 刪除一些常用單詞
            if key in stop_wred:
                del word_freq[key]
        return word_freq

改為：

st = open("stop_word.txt", ‘r‘)
st1 = st.read()


def process_buffer(bvffer, f):
   if bvffer:
        word_freq = {}
        # 下面添加處理緩沖區 bvffer代碼，統計每個單詞的頻率，存放在字典word_freq
        for ch in ‘“‘!;,.?”‘:  # 將文本內容都改為小寫且除去文本中的中英文標點符號
            bvffer = bvffer.lower().replace(ch, " ")
        bvffer = bvffer.strip().split()  # strip()刪除空白符；split()以空格分割字符串
        regex = "^[a-z]{4}(\w)*"
        words = []
        for word in bvffer:  # 判定是否是符合單詞設定
            if re.match(regex, word):
                words.append(word)
        print(‘words:‘, len(words), file=f)  # 輸出單詞總數
        for word in words:  # 獲取字典
            word_freq[word] = word_freq.get(word, 0) + 1
        for key in list(word_freq):  # 刪除一些常用單詞
            if key in st1:
                del word_freq[key]
        return word_freq

（1）按執行次數

技術分享圖片

（2）按執行時間

技術分享圖片

四、其他

1、結對編程時間開銷（單位：小時）

這次結對編程大概用了8個多小時，因為不精通python，很多內容沒接觸過，所以做的時候就基本是邊學邊寫的。

2、結對編程照片

五、事後分析與總結

1、簡述結對編程時，針對某個問題的討論決策過程

一開始使用的停詞表是直接寫在py文件中的，但是考慮到效率問題，最終采用txt來讀取停詞表，來實現刪除不想要的單詞。

2、評價

（1）周磊對俞林森的評價：個人能力強，不遺余力參與到合作編程中，與他討論的時候能夠給我很大的啟發，並且會提出代碼中不足的地方，希望能與他再次合作！

（2）俞林森對周磊的評價：周磊同學對python有過接觸，而我基本就是個小白，此次他編寫了大部分內容，我做到盡量不拖後腿。

3、關於結對過程的建議

結對編程的過程收獲頗多，我覺得結對編程有好有壞，但是好處遠遠大於的不好的地方。兩個人難免會遇到意見不同的時候，關鍵是看此時如何協調、如何溝通、如何采納。如果團隊內部不能很好地處理這些分歧，那麽非但不能提高效率，反而會拖慢工作的進程。如果團隊協調得很好，那麽兩個人的力量是絕對大過一個人的。一個人的想法始終有限，兩個人或者一群人合作，說不定還能擦出思想的火花。

4 、其它：

希望下次能接觸到其它語言的題目。

軟工作業4：詞頻統計 ——基本功能

結對編程 arm 如何溝通如果單詞 for循環路徑決策 match 一、基本信息 1、編譯環境：Pycharm2018、Python3.8 2、作者：1613072007 周磊 1613072008 俞林森 3、項目地址：https://gitee

軟工作業4：詞頻統計 ——基本功能

軟工作業4：詞頻統計 ——基本功能

軟工作業4：詞頻統計——基本功能

軟工作業4：詞頻統計

作業 4：詞頻統計——基本功能

作業4：詞頻統計——基本功能

軟工作業3：詞頻統計

作業四：詞頻統計-基本功能

軟工作業 4：結對項目之詞頻統計——基本功能

軟工作業 4：結對專案之詞頻統計——基本功能

軟工作業4：用戶體驗分析：以 “師路南通網站” 為例

作業 5：詞頻統計——增強功能

軟工作業三：對輸入文件的詞頻統計

軟工作業3：Python詞頻統計

軟工作業 5：結對專案之詞頻統計——增強功能

[SakuraiYo][軟工作業(4)]用戶體驗分析：以 “師路南通網站” 為例

軟工作業2：硬幣遊戲——代碼的分析與改進

軟工作業3：用戶體驗分析——以“南通大學教務管理系統微信公眾號”為例

軟工作業1：wc.exe項目開發（java）

軟工作業 2：時事點評-紅芯瀏覽器事件

軟工作業2：時事點評-紅芯瀏覽器事件

軟工作業4：詞頻統計 ——基本功能

相關推薦