1. 程式人生 > >Python3——文字標題關鍵字提取_jieba分詞+sklearn計算tf-idf詞語權重

Python3——文字標題關鍵字提取_jieba分詞+sklearn計算tf-idf詞語權重

功能: 實現文字標題關鍵字的提取

由於jieba自身的jieba.analyse.set_idf_path方法依賴於idf.txt.big的逆文件率語料庫,因此本例採用sklearn轉換詞向量的方法,依靠包含的文件來計算TF-IDF的值。

Step1: 匯入相關工具包

import os
import jieba
import sys
import time
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
sys.path.append("../")
jieba.load_userdict('userdictML.txt')
STOP_WORDS = set(("進展", "研究", "應用", "綜述", "方法", "方式", "問題", "分析", "基於", "論文", "面向", "txt", "."))

Step2: 獲取檔名列表

def getFileList(path):
    filelist = []
    files = os.listdir(path)
    for f in files:
        if f[0] == '.':
            pass
        else:
            filelist.append(f)
    return filelist, path

Step3: 對檔名進行分詞並儲存分詞結果

def fenci(filename, segPath):
    # 儲存分詞結果的資料夾
    if not os.path.exists(segPath):
        os.mkdir(segPath)

    # 對標題進行分詞處理
    seg_list = jieba.cut(filename)
    # 過濾停用詞
    result = []
    for seg in seg_list:
        seg = ''.join(seg.split())
        if len(seg.strip()) >= 2 and seg.lower() not in STOP_WORDS:
            result.append(seg)

    # 將分詞後的結果用空格隔開,儲存至本地
    f = open(segPath + "/" + filename + "-seg.txt", "w+")
    f.write(' '.join(result))
    f.close()

Step4: 用sklearn工具包計算TF-IDF值,排序後按tfidw閾值進行過濾,並儲存關鍵字和TF-IDF值到本地

# 讀取已經分詞好的標題文件。利用sklearn工具包進行TF-IDF計算
def Tfidf(filelist, sFilePath, path, tfidfw):
    corpus = []
    for ff in filelist:
        fname = path + ff
        f = open(fname + "-seg.txt", 'r+')
        content = f.read()
        f.close()
        corpus.append(content)

    vectorizer = CountVectorizer()  # 該類會將文字中的詞語轉換為詞頻矩陣,矩陣元素a[i][j] 表示j詞在i類文字下的詞頻
    transformer = TfidfTransformer()  # 該類會統計每個詞語的tf-idf權值
    tfidf = transformer.fit_transform(vectorizer.fit_transform(corpus))  # 第一個fit_transform是計算tf-idf,第二個fit_transform是將文字轉為詞頻矩陣
    word = vectorizer.get_feature_names()  # 獲取詞袋模型中的所有詞語
    weight = tfidf.toarray()  # 將tf-idf矩陣抽取出來,元素a[i][j]表示j詞在i類文字中的tf-idf權重

    if not os.path.exists(sFilePath):
        os.mkdir(sFilePath)

    for i in range(len(weight)):
        print('----------writing all the tf-idf in the ', i, 'file into ', sFilePath + '/', i, ".txt----------")
        f = open(sFilePath + "/" + str(i) + ".txt", 'w+')
        result = {}
        for j in range(len(word)):
            if weight[i][j] >= tfidfw:
                result[word[j]] = weight[i][j]
        resultsort = sorted(result.items(), key=lambda item: item[1], reverse=True)
        for z in range(len(resultsort)):
            f.write(resultsort[z][0] + " " + str(resultsort[z][1]) + '\r\n')
            print(resultsort[z][0] + " " + str(resultsort[z][1]))
        f.close()

本文對以下標題進行驗證:

執行結果:

D:\PyCharm\PycharmProjects\ML\Scripts\python.exe D:/PyCharm/PycharmProjects/filecutwords.py
Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\sxq\AppData\Local\Temp\jieba.cache
Loading model cost 0.535 seconds.
Prefix dict has been built succesfully.
Using jieba on 基於機器學習特性的資料中心能耗優化方法.txt
Using jieba on 基於深度學習網路的射線影象缺陷識別方法.txt
Using jieba on 大資料下的機器學習演算法綜述.txt
Using jieba on 李群機器學習十年研究進展.txt
Using jieba on 深度學習在手寫漢字識別中的應用綜述.txt
Using jieba on 稀疏學習優化問題的求解綜述.txt
Using jieba on 貝葉斯機器學習前沿進展綜述.txt
Using jieba on 面向自然語言處理的深度學習研究.txt
----------writing all the tf-idf in the  0 file into  ./tfidffile1540195000.2757373/ 0 .txt----------
資料中心 0.493598032622
特性 0.493598032622
能耗 0.493598032622
優化 0.413673674087
機器學習 0.312980890698
----------writing all the tf-idf in the  1 file into  ./tfidffile1540195000.2757373/ 1 .txt----------
影象 0.42551236292
射線 0.42551236292
缺陷 0.42551236292
網路 0.42551236292
識別方法 0.42551236292
深度學習 0.307727387491
----------writing all the tf-idf in the  2 file into  ./tfidffile1540195000.2757373/ 2 .txt----------
資料 0.645220633322
演算法 0.645220633322
機器學習 0.409121826197
----------writing all the tf-idf in the  3 file into  ./tfidffile1540195000.2757373/ 3 .txt----------
十年 0.645220633322
李群 0.645220633322
機器學習 0.409121826197
----------writing all the tf-idf in the  4 file into  ./tfidffile1540195000.2757373/ 4 .txt----------
手寫 0.532774235825
漢字 0.532774235825
識別 0.532774235825
深度學習 0.385298379083
----------writing all the tf-idf in the  5 file into  ./tfidffile1540195000.2757373/ 5 .txt----------
求解 0.608313154613
稀疏學習 0.608313154613
優化 0.509813899232
----------writing all the tf-idf in the  6 file into  ./tfidffile1540195000.2757373/ 6 .txt----------
前沿 0.645220633322
貝葉斯 0.645220633322
機器學習 0.409121826197
----------writing all the tf-idf in the  7 file into  ./tfidffile1540195000.2757373/ 7 .txt----------
處理 0.629565219678
自然語言 0.629565219678
深度學習 0.455296901311

Process finished with exit code 0