1. 程式人生 > >用Python實現針對英文論文的詞頻分析

用Python實現針對英文論文的詞頻分析

有時候看英文論文,高頻詞彙是一些術語,可能不太認識,因此我們可以先分析一下該論文的詞頻,對於高頻詞彙可以在看論文之前就記住其意思,這樣看論文思路會更順暢一旦,接下來就講一下如何用python輸出一篇英文論文的詞彙出現頻次。

首先肯定要先把論文從PDF版轉為txt格式,一般來說直接轉會出現亂碼,建議先轉為Word格式,之後再複製為txt文字格式。

接下來附上含有詳細註釋的程式碼

#論文詞頻分析
#You should convert the file to text format

__author__ = 'Chen Hong'

#Read the text and save all the words in a list
def readtxt(filename):
    fr = open(filename, 'r')
    wordsL = []#use this list to save the words
    for word in fr:
        word = word.strip()
        word = word.split()
        wordsL = wordsL + word
    fr.close()
    return wordsL

#count the frequency of every word and store in a dictionary
#And sort dictionaries by value from large to small
def count(wordsL):
    wordsD = {}
    for x in wordsL:
        #move these words that we don't need
        if Judge(x):
            continue
        #count
        if not x in wordsD:
            wordsD[x] = 1
        wordsD[x] += 1
    #Sort dictionaries by value from large to small
    wordsInorder = sorted(wordsD.items(), key=lambda x:x[1], reverse = True)
    return wordsInorder
    
#juege whether the word is that we want to move such as punctuation or letter
#You can modify this function to move more words such as number
def Judge(word):
    punctList = [' ','\t','\n',',','.',':','?']#juege whether the word is punctuation
    letterList = ['a','b','c','d','m','n','x','p','t']#juege whether the word is letter
    if word in punctList:
        return True
    elif word in letterList:
        return True
    else:
        return False


#Read the file and output the file 
filename = 'F:\\python\\Paper1.txt'
wordsL = readtxt(filename)
words = count(wordsL)
fw = open('F:\\python\\Words In Order_1.txt','w')
for item in words:
    fw.write(item[0] + ' ' + str(item[1]) + '\n')
fw.close()