1. 程式人生 > >python 文字單詞提取和詞頻統計

python 文字單詞提取和詞頻統計

這些對文字的操作經常用到, 那我就總結一下。 陸續補充。。。

操作:

strip_html(cls, text) 去除html標籤
separate_words(cls, text, min_lenth=3) 文字提取
get_words_frequency(cls, words_list) 獲取詞頻

原始碼:

class DocProcess(object):

    @classmethod
    def strip_html(cls, text):
        """
            Delete html tags in text.
            text is String
        """
new_text = " " is_html = False for character in text: if character == "<": is_html = True elif character == ">": is_html = False new_text += " " elif is_html is False: new_text += character return
new_text @classmethod def separate_words(cls, text, min_lenth=3): """ Separate text into words in list. """ splitter = re.compile("\\W+") return [s.lower() for s in splitter.split(text) if len(s) > min_lenth] @classmethod def get_words_frequency
(cls, words_list):
""" Get frequency of words in words_list. return a dict. """ num_words = {} for word in words_list: num_words[word] = num_words.get(word, 0) + 1 return num_words