1. 程式人生 > >NLP入門(二)探究TF-IDF的原理

NLP入門(二)探究TF-IDF的原理

sports unless 頻率 ops ask png ref while usually

TF-IDF介紹

??TF-IDF是NLP中一種常用的統計方法,用以評估一個字詞對於一個文件集或一個語料庫中的其中一份文件的重要程度,通常用於提取文本的特征,即關鍵詞。字詞的重要性隨著它在文件中出現的次數成正比增加,但同時會隨著它在語料庫中出現的頻率成反比下降。
??在NLP中,TF-IDF的計算公式如下:

\[tfidf = tf*idf.\]

其中,tf是詞頻(Term Frequency),idf為逆向文件頻率(Inverse Document Frequency)。
??tf為詞頻,即一個詞語在文檔中的出現頻率,假設一個詞語在整個文檔中出現了i次,而整個文檔有N個詞語,則tf的值為i/N.
??idf為逆向文件頻率,假設整個文檔有n篇文章,而一個詞語在k篇文章中出現,則idf值為

\[idf=\log_{2}(\frac{n}{k}).\]

當然,不同地方的idf值計算公式會有稍微的不同。比如有些地方會在分母的k上加1,防止分母為0,還有些地方會讓分子,分母都加上1,這是smoothing技巧。在本文中,還是采用最原始的idf值計算公式,因為這與gensim裏面的計算公式一致。
??假設整個文檔有D篇文章,則單詞i在第j篇文章中的tfidf值為

技術分享圖片

??以上就是TF-IDF的計算方法。

文本介紹及預處理

??我們將采用以下三個示例文本:

text1 ="""
Football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal. 
Unqualified, the word football is understood to refer to whichever form of football is the most popular 
in the regional context in which the word appears. Sports commonly called football in certain places 
include association football (known as soccer in some countries); gridiron football (specifically American 
football or Canadian football); Australian rules football; rugby football (either rugby league or rugby union); 
and Gaelic football. These different variations of football are known as football codes.
"""

text2 = """
Basketball is a team sport in which two teams of five players, opposing one another on a rectangular court, 
compete with the primary objective of shooting a basketball (approximately 9.4 inches (24 cm) in diameter) 
through the defender‘s hoop (a basket 18 inches (46 cm) in diameter mounted 10 feet (3.048 m) high to a backboard 
at each end of the court) while preventing the opposing team from shooting through their own hoop. A field goal is 
worth two points, unless made from behind the three-point line, when it is worth three. After a foul, timed play stops 
and the player fouled or designated to shoot a technical foul is given one or more one-point free throws. The team with 
the most points at the end of the game wins, but if regulation play expires with the score tied, an additional period 
of play (overtime) is mandated.
"""

text3 = """
Volleyball, game played by two teams, usually of six players on a side, in which the players use their hands to bat a 
ball back and forth over a high net, trying to make the ball touch the court within the opponents’ playing area before 
it can be returned. To prevent this a player on the opposing team bats the ball up and toward a teammate before it touches 
the court surface—that teammate may then volley it back across the net or bat it to a third teammate who volleys it across 
the net. A team is allowed only three touches of the ball before it must be returned over the net.
"""

這三篇文章分別是關於足球,籃球,排球的介紹,它們組成一篇文檔。
??接下來是文本的預處理部分。
??首先是對文本去掉換行符,然後是分句,分詞,再去掉其中的標點,完整的Python代碼如下,輸入的參數為文章text:

import nltk
import string

# 文本預處理
# 函數:text文件分句,分詞,並去掉標點
def get_tokens(text):
    text = text.replace(‘\n‘, ‘‘)
    sents = nltk.sent_tokenize(text)  # 分句
    tokens = []
    for sent in sents:
        for word in nltk.word_tokenize(sent):  # 分詞
            if word not in string.punctuation: # 去掉標點
                tokens.append(word)
    return tokens

??接著,去掉文章中的通用詞(stopwords),然後統計每個單詞的出現次數,完整的Python代碼如下,輸入的參數為文章text:

from nltk.corpus import stopwords     #停用詞

# 對原始的text文件去掉停用詞
# 生成count字典,即每個單詞的出現次數
def make_count(text):
    tokens = get_tokens(text)
    filtered = [w for w in tokens if not w in stopwords.words(‘english‘)]    #去掉停用詞
    count = Counter(filtered)
    return count

以text3為例,生成的count字典如下:

Counter({‘ball‘: 4, ‘net‘: 4, ‘teammate‘: 3, ‘returned‘: 2, ‘bat‘: 2, ‘court‘: 2, ‘team‘: 2, ‘across‘: 2, ‘touches‘: 2, ‘back‘: 2, ‘players‘: 2, ‘touch‘: 1, ‘must‘: 1, ‘usually‘: 1, ‘side‘: 1, ‘player‘: 1, ‘area‘: 1, ‘Volleyball‘: 1, ‘hands‘: 1, ‘may‘: 1, ‘toward‘: 1, ‘A‘: 1, ‘third‘: 1, ‘two‘: 1, ‘six‘: 1, ‘opposing‘: 1, ‘within‘: 1, ‘prevent‘: 1, ‘allowed‘: 1, ‘’‘: 1, ‘playing‘: 1, ‘played‘: 1, ‘volley‘: 1, ‘surface—that‘: 1, ‘volleys‘: 1, ‘opponents‘: 1, ‘use‘: 1, ‘high‘: 1, ‘teams‘: 1, ‘bats‘: 1, ‘To‘: 1, ‘game‘: 1, ‘make‘: 1, ‘forth‘: 1, ‘three‘: 1, ‘trying‘: 1})

Gensim中的TF-IDF

??對文本進行預處理後,對於以上三個示例文本,我們都會得到一個count字典,裏面是每個文本中單詞的出現次數。下面,我們將用gensim中的已實現的TF-IDF模型,來輸出每篇文章中TF-IDF排名前三的單詞及它們的tfidf值,完整的代碼如下:

from nltk.corpus import stopwords     #停用詞
from gensim import corpora, models, matutils

#training by gensim‘s Ifidf Model
def get_words(text):
    tokens = get_tokens(text)
    filtered = [w for w in tokens if not w in stopwords.words(‘english‘)]
    return filtered

# get text
count1, count2, count3 = get_words(text1), get_words(text2), get_words(text3)
countlist = [count1, count2, count3]
# training by TfidfModel in gensim
dictionary = corpora.Dictionary(countlist)
new_dict = {v:k for k,v in dictionary.token2id.items()}
corpus2 = [dictionary.doc2bow(count) for count in countlist]
tfidf2 = models.TfidfModel(corpus2)
corpus_tfidf = tfidf2[corpus2]

# output
print("\nTraining by gensim Tfidf Model.......\n")
for i, doc in enumerate(corpus_tfidf):
    print("Top words in document %d"%(i + 1))
    sorted_words = sorted(doc, key=lambda x: x[1], reverse=True)    #type=list
    for num, score in sorted_words[:3]:
        print("    Word: %s, TF-IDF: %s"%(new_dict[num], round(score, 5)))

輸出的結果如下:

Training by gensim Tfidf Model.......

Top words in document 1
    Word: football, TF-IDF: 0.84766
    Word: rugby, TF-IDF: 0.21192
    Word: known, TF-IDF: 0.14128
Top words in document 2
    Word: play, TF-IDF: 0.29872
    Word: cm, TF-IDF: 0.19915
    Word: diameter, TF-IDF: 0.19915
Top words in document 3
    Word: net, TF-IDF: 0.45775
    Word: teammate, TF-IDF: 0.34331
    Word: across, TF-IDF: 0.22888

輸出的結果還是比較符合我們的預期的,比如關於足球的文章中提取了football, rugby關鍵詞,關於籃球的文章中提取了plat, cm關鍵詞,關於排球的文章中提取了net, teammate關鍵詞。

自己動手實踐TF-IDF模型

??有了以上我們對TF-IDF模型的理解,其實我們自己也可以動手實踐一把,這是學習算法的最佳方式!
??以下是筆者實踐TF-IDF的代碼(接文本預處理代碼):

import math

# 計算tf
def tf(word, count):
    return count[word] / sum(count.values())
# 計算count_list有多少個文件包含word
def n_containing(word, count_list):
    return sum(1 for count in count_list if word in count)

# 計算idf
def idf(word, count_list):
    return math.log2(len(count_list) / (n_containing(word, count_list)))    #對數以2為底
# 計算tf-idf
def tfidf(word, count, count_list):
    return tf(word, count) * idf(word, count_list)

# TF-IDF測試
count1, count2, count3 = make_count(text1), make_count(text2), make_count(text3)
countlist = [count1, count2, count3]
print("Training by original algorithm......\n")
for i, count in enumerate(countlist):
    print("Top words in document %d"%(i + 1))
    scores = {word: tfidf(word, count, countlist) for word in count}
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)    #type=list
    # sorted_words = matutils.unitvec(sorted_words)
    for word, score in sorted_words[:3]:
        print("    Word: %s, TF-IDF: %s"%(word, round(score, 5)))

輸出結果如下:

Training by original algorithm......

Top words in document 1
    Word: football, TF-IDF: 0.30677
    Word: rugby, TF-IDF: 0.07669
    Word: known, TF-IDF: 0.05113
Top words in document 2
    Word: play, TF-IDF: 0.05283
    Word: inches, TF-IDF: 0.03522
    Word: worth, TF-IDF: 0.03522
Top words in document 3
    Word: net, TF-IDF: 0.10226
    Word: teammate, TF-IDF: 0.07669
    Word: across, TF-IDF: 0.05113

可以看到,筆者自己動手實踐的TF-IDF模型提取的關鍵詞與gensim一致,至於籃球中為什麽後兩個單詞不一致,是因為這些單詞的tfidf一樣,隨機選擇的結果不同而已。但是有一個問題,那就是計算得到的tfidf值不一樣,這是什麽原因呢?
??查閱gensim中計算tf-idf值的源代碼(https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/models/tfidfmodel.py):

技術分享圖片

技術分享圖片

也就是說,gensim對得到的tf-idf向量做了規範化(normalize),將其轉化為單位向量。因此,我們需要在剛才的代碼中加入規範化這一步,代碼如下:

import numpy as np

# 對向量做規範化, normalize
def unitvec(sorted_words):
    lst = [item[1] for item in sorted_words]
    L2Norm = math.sqrt(sum(np.array(lst)*np.array(lst)))
    unit_vector = [(item[0], item[1]/L2Norm) for item in sorted_words]
    return unit_vector

# TF-IDF測試
count1, count2, count3 = make_count(text1), make_count(text2), make_count(text3)
countlist = [count1, count2, count3]
print("Training by original algorithm......\n")
for i, count in enumerate(countlist):
    print("Top words in document %d"%(i + 1))
    scores = {word: tfidf(word, count, countlist) for word in count}
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)    #type=list
    sorted_words = unitvec(sorted_words)   # normalize
    for word, score in sorted_words[:3]:
        print("    Word: %s, TF-IDF: %s"%(word, round(score, 5)))

輸出結果如下:

Training by original algorithm......

Top words in document 1
    Word: football, TF-IDF: 0.84766
    Word: rugby, TF-IDF: 0.21192
    Word: known, TF-IDF: 0.14128
Top words in document 2
    Word: play, TF-IDF: 0.29872
    Word: shooting, TF-IDF: 0.19915
    Word: diameter, TF-IDF: 0.19915
Top words in document 3
    Word: net, TF-IDF: 0.45775
    Word: teammate, TF-IDF: 0.34331
    Word: back, TF-IDF: 0.22888

現在的輸出結果與gensim得到的結果一致!

總結

??Gensim是Python做NLP時鼎鼎大名的模塊,有空還是多讀讀源碼吧!以後,我們還會繼續介紹TF-IDF在其它方面的應用,歡迎大家交流~

註意:本人現已開通微信公眾號: Python爬蟲與算法(微信號為:easy_web_scrape), 歡迎大家關註哦~~

本文的完整代碼如下:

import nltk
import math
import string
from nltk.corpus import stopwords     #停用詞
from collections import Counter       #計數
from gensim import corpora, models, matutils

text1 ="""
Football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal. 
Unqualified, the word football is understood to refer to whichever form of football is the most popular 
in the regional context in which the word appears. Sports commonly called football in certain places 
include association football (known as soccer in some countries); gridiron football (specifically American 
football or Canadian football); Australian rules football; rugby football (either rugby league or rugby union); 
and Gaelic football. These different variations of football are known as football codes.
"""

text2 = """
Basketball is a team sport in which two teams of five players, opposing one another on a rectangular court, 
compete with the primary objective of shooting a basketball (approximately 9.4 inches (24 cm) in diameter) 
through the defender‘s hoop (a basket 18 inches (46 cm) in diameter mounted 10 feet (3.048 m) high to a backboard 
at each end of the court) while preventing the opposing team from shooting through their own hoop. A field goal is 
worth two points, unless made from behind the three-point line, when it is worth three. After a foul, timed play stops 
and the player fouled or designated to shoot a technical foul is given one or more one-point free throws. The team with 
the most points at the end of the game wins, but if regulation play expires with the score tied, an additional period 
of play (overtime) is mandated.
"""

text3 = """
Volleyball, game played by two teams, usually of six players on a side, in which the players use their hands to bat a 
ball back and forth over a high net, trying to make the ball touch the court within the opponents’ playing area before 
it can be returned. To prevent this a player on the opposing team bats the ball up and toward a teammate before it touches 
the court surface—that teammate may then volley it back across the net or bat it to a third teammate who volleys it across 
the net. A team is allowed only three touches of the ball before it must be returned over the net.
"""

# 文本預處理
# 函數:text文件分句,分詞,並去掉標點
def get_tokens(text):
    text = text.replace(‘\n‘, ‘‘)
    sents = nltk.sent_tokenize(text)  # 分句
    tokens = []
    for sent in sents:
        for word in nltk.word_tokenize(sent):  # 分詞
            if word not in string.punctuation: # 去掉標點
                tokens.append(word)
    return tokens

# 對原始的text文件去掉停用詞
# 生成count字典,即每個單詞的出現次數
def make_count(text):
    tokens = get_tokens(text)
    filtered = [w for w in tokens if not w in stopwords.words(‘english‘)]    #去掉停用詞
    count = Counter(filtered)
    return count

# 計算tf
def tf(word, count):
    return count[word] / sum(count.values())
# 計算count_list有多少個文件包含word
def n_containing(word, count_list):
    return sum(1 for count in count_list if word in count)

# 計算idf
def idf(word, count_list):
    return math.log2(len(count_list) / (n_containing(word, count_list)))    #對數以2為底
# 計算tf-idf
def tfidf(word, count, count_list):
    return tf(word, count) * idf(word, count_list)

import numpy as np

# 對向量做規範化, normalize
def unitvec(sorted_words):
    lst = [item[1] for item in sorted_words]
    L2Norm = math.sqrt(sum(np.array(lst)*np.array(lst)))
    unit_vector = [(item[0], item[1]/L2Norm) for item in sorted_words]
    return unit_vector

# TF-IDF測試
count1, count2, count3 = make_count(text1), make_count(text2), make_count(text3)
countlist = [count1, count2, count3]
print("Training by original algorithm......\n")
for i, count in enumerate(countlist):
    print("Top words in document %d"%(i + 1))
    scores = {word: tfidf(word, count, countlist) for word in count}
    sorted_words = sorted(scores.items(), key=lambda x: x[1], reverse=True)    #type=list
    sorted_words = unitvec(sorted_words)   # normalize
    for word, score in sorted_words[:3]:
        print("    Word: %s, TF-IDF: %s"%(word, round(score, 5)))

#training by gensim‘s Ifidf Model
def get_words(text):
    tokens = get_tokens(text)
    filtered = [w for w in tokens if not w in stopwords.words(‘english‘)]
    return filtered

# get text
count1, count2, count3 = get_words(text1), get_words(text2), get_words(text3)
countlist = [count1, count2, count3]
# training by TfidfModel in gensim
dictionary = corpora.Dictionary(countlist)
new_dict = {v:k for k,v in dictionary.token2id.items()}
corpus2 = [dictionary.doc2bow(count) for count in countlist]
tfidf2 = models.TfidfModel(corpus2)
corpus_tfidf = tfidf2[corpus2]

# output
print("\nTraining by gensim Tfidf Model.......\n")
for i, doc in enumerate(corpus_tfidf):
    print("Top words in document %d"%(i + 1))
    sorted_words = sorted(doc, key=lambda x: x[1], reverse=True)    #type=list
    for num, score in sorted_words[:3]:
        print("    Word: %s, TF-IDF: %s"%(new_dict[num], round(score, 5)))
        
"""
輸出結果:

Training by original algorithm......

Top words in document 1
    Word: football, TF-IDF: 0.84766
    Word: rugby, TF-IDF: 0.21192
    Word: word, TF-IDF: 0.14128
Top words in document 2
    Word: play, TF-IDF: 0.29872
    Word: inches, TF-IDF: 0.19915
    Word: points, TF-IDF: 0.19915
Top words in document 3
    Word: net, TF-IDF: 0.45775
    Word: teammate, TF-IDF: 0.34331
    Word: bat, TF-IDF: 0.22888

Training by gensim Tfidf Model.......

Top words in document 1
    Word: football, TF-IDF: 0.84766
    Word: rugby, TF-IDF: 0.21192
    Word: known, TF-IDF: 0.14128
Top words in document 2
    Word: play, TF-IDF: 0.29872
    Word: cm, TF-IDF: 0.19915
    Word: diameter, TF-IDF: 0.19915
Top words in document 3
    Word: net, TF-IDF: 0.45775
    Word: teammate, TF-IDF: 0.34331
    Word: across, TF-IDF: 0.22888
"""

NLP入門(二)探究TF-IDF的原理