1. 程式人生 > >基於doc2vec的中文文字聚類及去重

基於doc2vec的中文文字聚類及去重

  • Understand doc2vec
  • Data introduction
  • Train a model
  • Test the model
  • Cluster all the lyrics
  • Filter out the duplicates

1. Understand doc2vec [1]

doc2vec是基於word2vec演化而來,其本質是要學出文檔的一個表示,模型由谷歌科學家Quoc Le 和 Tomas Mikolov 2014年提出,並將論文發表在International Conference on Machine Learning上。由於word2vec的訓練一般有兩種策略:(1) 已知上下文,推測下一個詞;就是本文中的CBOW (2) 已知當前詞,推測上下文,也就是常說的skip-gram。

1.1 基於CBOW的word2vec

CBOW示意圖,來源未引用文獻[1]
給定一條文字包含文字:w1,w2,w3,…wT, 那麼word2vec的目標函式就是最大化概率函式
log probability
目標函式很好理解,本質上是給定視窗大小的n-gram。那麼用什麼資料訓練模型呢?模型的輸入和輸出其實就是一個二元組[上下文,預測詞]。就是用很多這樣的二元組來訓練模型。例如,給定句子:秋天傍晚小街路面溼潤。假定分詞結果為[秋天,傍晚, 小街, 路面, 溼潤],現在要預測小街,視窗大小為2, 那麼用來訓練模型的資料集包括[秋天,小街],[傍晚,小街], [路面,小街], [溼潤,小街]。
那麼預測詞的取值空間其實是整個詞袋,因此預測某一個詞可以被視為一個多分類問題,假如詞袋一共有1000個詞,那麼其實wordvec的訓練過程就是一個1000-類 分類問題。因此一般來說,詞預測,也就是在神經網路的最後一層,使用softmax。
在這裡插入圖片描述


在這裡插入圖片描述
由於1000-類這樣的分類問題顯然直接用線性模型效率會非常低,何況一般的詞袋都會有上萬詞,因此目前比較常用的方法是最後的輸出層採用層次softmax將時間效率縮減到O(logn),其本質是一棵霍夫曼樹。詳細的解釋以及負取樣可以看一下這篇:
https://blog.csdn.net/zhangxb35/article/details/74716245。
其實word2vec要求的並不是output layer,而是為了得到hidden layer,在訓練結束後,隱層的權重矩陣其實是一個NxM的矩陣,N是詞的個數,M是embedding的維度,這個矩陣才是我們想要訓練出來的。word2vec的訓練保證了相近詞義的詞在向量空間中的距離更近。這也避免了詞袋模型不考慮語序和語義的問題。

1.2 基於CBOW的doc2vec

為了得到文件的向量表示,科學家給原來的模型增加了一個輸入向量,即文件id。其背後的思想可以解釋為,預測一個詞與文件本身整體的含義是分不開的,這等於是我們閱讀理解以前只考慮近距離上下文,現在需要加入文章的主題來理解一個詞。模型圖如下
在這裡插入圖片描述
該演算法包括兩個步驟:

  • 在已有的文件上,訓練詞向量W, softmax 權重 U, b 以及文件向量 D。
  • 如果有新的文件加入,那麼推斷環節就是保持W, U, b不變,只訓練D。

2. Data introduction.

本文所用資料集為歌詞資料,包含約14萬條中文歌詞,目的是為了不同歌手間的相同歌詞去重(比如翻唱改編等)。因此採用先訓練doc2vec,再將文字聚類,最終在類中進行歌詞比對去重的方式。
資料格式為json,其中一條資料:

{
	"artist_id": 1,
	"author": "李宇春",
	"title": "Dance To The Music",
	"song_id": 119548568,
	"album_title": "1987我不知會遇見你",
	"album_id": 117854447,
	"language": "國語",
	"versions": NaN,
	"country": "內地",
	"area": 0.0,
	"file_duration": 215,
	"publishtime": "2014-7-30",
	"year": 2014.0,
	"month": 7.0,
	"date": 30.0,
	"publish_time": "2014/7/30",
	"hot": 2011,
	"listen_total": 378.0,
	"lyrics": {
		"status": "OK",
		"content": {
			"[00:02.00]": "Dance To The Music",
			"[00:04.00]": "作曲 張亞東",
			"[00:05.00]": "作詞:李宇春",
			"[00:06.00]": "演唱:李宇春",
			"[00:08.00]": "",
			"[00:16.57]": "Dance to the music",
			"[00:20.64]": "Dance to the music",
			"[00:24.64]": "Dance to the music",
			"[00:28.78]": "Dance to the music",
			"[00:33.11]": "音樂要開最大聲才夠酷",
			"[00:36.65]": "酷斃那些薰陶出的嚴肅",
			"[00:41.20]": "跳級跳槽不如先跳個舞",
			"[00:45.20]": "踩著節奏慢半拍也挺態度",
			"[00:49.17]": "Dance to the music (Come on! Have fun!)",
			"[00:53.11]": "Dance to the music (It's the right time!)",
			"[00:57.19]": "Dance to the music (Now is my)",
			"[01:01.21]": "Dance to the music (swing time!)",
			"[01:05.31]": "想去冒險就燒掉地圖",
			"[01:09.07]": "通往羅馬又不是隻有一條路",
			"[01:13.30]": "玩法自定分什麼勝負",
			"[01:17.31]": "開心唱歌才不一定非要照著譜",
			"[01:25.14]": "",
			"[01:38.16]": "偶爾弄丟了驚喜的生活",
			"[01:42.19]": "有沒有勁爆的八卦聽說",
			"[01:46.37]": "曲奇和巧克力打了個啵",
			"[01:50.38]": "別太當真我們只要趣多多",
			"[01:54.17]": "Dance to the music (Come on! Have fun!)",
			"[01:58.11]": "Dance to the music (It's the right time!)",
			"[02:02.26]": "Dance to the music (Now is my)",
			"[02:06.29]": "Dance to the music (swing time!)",
			"[02:10.40]": "想去冒險就燒掉地圖",
			"[02:14.22]": "通往羅馬又不是隻有一條路",
			"[02:18.46]": "玩法自定分什麼勝負",
			"[02:22.39]": "開心唱歌才不一定非要照著譜",
			"[02:25.96]": "Rap:",
			"[02:26.64]": "Talking all day is not cool",
			"[02:28.42]": "Noisy like a goose",
			"[02:30.51]": "Why don't you find something to do",
			"[02:32.51]": "Jump into the groove",
			"[02:34.65]": "This time I don't want to",
			"[02:36.66]": "want to be good",
			"[02:38.70]": "Please don't give me that look",
			"[02:40.72]": "I'm not in the mood",
			"[02:43.42]": "Dance to the music",
			"[02:47.05]": "Dance to the music",
			"[02:50.99]": "Dance to the music",
			"[02:54.84]": "Dance to the music",
			"[02:59.29]": "Dance to the music (Come on! Have fun!)",
			"[03:03.41]": "Dance to the music (It's the right time!)",
			"[03:07.34]": "Dance to the music (Now is my)",
			"[03:11.51]": "Dance to the music (swing time!)",
			"[03:22.88]": ""
		},
		"hash": 2625080655648660063
	}
}

3. Train a model.

訓練doc2vec需要首先進行分詞,本文采用結巴分詞器。

def loadLyrics(self, filepath):
        dict_stat = {'ReadError': 0}
        lines = []
        with codecs.open(filepath, 'r', encoding='utf-8') as f:
            for line in f:
                #try:
                obj = json.loads(line)
                lyrics = obj["lyrics"]["content"].values()
                lyrics_str = ""
                for lv in lyrics:
                    lyrics_str += lv
                #print(lyrics_str)
                lyrics_str = re.sub("[a-zA-Z0-9\!\%\[\]\,\。\:\:\-\"\“\”\(\)(){}\'']", "", lyrics_str)
                lyrics_str = lyrics_str.replace(" ","")
                words= list(jieba.cut(str(lyrics_str)))
                word_list = []
                for w in words:
                    word_list.append(w)
                #filter those with less words in the lyrics
                if len(word_list) > 20:
                    obj["lyrics"]["seg"] = word_list
                    lines.append(obj)

然後使用gensim中的doc2vec演算法訓練模型,設定視窗大小為3,文件向量長度為100,訓練後儲存模型。

# -*- coding: utf-8 -*-
import sys
import gensim
import sklearn
import numpy as np
import jieba
 
from gensim.models.doc2vec import Doc2Vec, LabeledSentence
from datamanager import DataManager 
TaggededDocument = gensim.models.doc2vec.TaggedDocument

def get_datasest():
    dm = DataManager()
    list_lyrics = dm.loadLyrics('lyrics1.json')
    x_train = []
    #y = np.concatenate(np.ones(len(docs)))
    for i, lyric in enumerate(list_lyrics):
        try:
            document = TaggededDocument(lyric["lyrics"]["seg"], tags=[i])
            x_train.append(document)
        except:
            print(i)
            print(type(words))

    #print(x_train)
 
    return x_train
 
def getVecs(model, corpus, size):
    vecs = [np.array(model.docvecs[z.tags[0]].reshape(1, size)) for z in corpus]
    return np.concatenate(vecs)
 
def train(x_train, size=100, epoch_num=1):
    model_dm = Doc2Vec(x_train,min_count=1, window = 3, size = size, sample=1e-3, negative=5, workers=4)
    model_dm.train(x_train, total_examples=model_dm.corpus_count, epochs=70)
    model_dm.save('model_dm_lyrics.vec')
    model_dm.delete_temporary_training_data(keep_doctags_vectors=True, keep_inference=True)
 
    return model_dm

4. Test a model.

給頂一個分詞結果,通過相似度結果來測試模型的有效性。

def test():
    model_dm = Doc2Vec.load("model_dm_lyrics.vec")
    x = get_datasest()
    test_text = 《曹操》歌詞的分詞結果,該輸入是一個數組
    #test_text = "作詞:林秋離作曲:林俊杰不是英雄不讀三國若是英雄怎麼能不懂寂寞獨自走下長阪坡月光太溫柔曹操不囉唆一心要拿荊州用陰謀陽謀明說暗奪淡泊東漢末年分三國烽火連天不休兒女情長被亂世左右誰來煮酒爾虞我詐是三國說不清對與錯紛紛擾擾千百年以後一切又從頭不是英雄不讀三國若是英雄怎麼能不懂寂寞獨自走下長阪坡月光太溫柔曹操不囉唆一心要拿荊州用陰謀陽謀明說暗奪淡泊東漢末年分三國烽火連天不休兒女情長被亂世左右誰來煮酒爾虞我詐是三國說不清對與錯紛紛擾擾千百年以後一切又從頭"
    #text_cut = jieba.cut(test_text)
    #text_raw = []
    #for i in list(text_cut):
    #    text_raw.append(i)
    inferred_vector_dm = model_dm.infer_vector(test_text)
    print (inferred_vector_dm)
    sims = model_dm.docvecs.most_similar([inferred_vector_dm], topn=10)
 
    return sims

結果如下:
在這裡插入圖片描述

5. Cluster all the lyrics。

根據預訓練的doc2vec,本文使用k-means對文件進行聚類,但是面臨一個問題:多少各類合適?通過取值不同的聚類個數,計算SSE並對結果繪圖尋找臂拐點作為合適的聚類個數。

5.1 選擇聚類個數
from gensim.models.doc2vec import Doc2Vec, LabeledSentence
from sklearn.cluster import KMeans
from datamanager import DataManager
from tqdm import tqdm
import pandas as pd
import numpy as np
import json
import codecs
from sklearn.cluster import AgglomerativeClustering
def get_vectors():
    list_lyrics = get_dataset()
    infered_vectors_list = []
    print("load doc2vec model...")
    model_dm = Doc2Vec.load("model_dm_lyrics.vec")
    print("load train vectors...")
    i = 0
    for lyric in list_lyrics:
        vector = model_dm.infer_vector(lyric["lyrics"]["seg"])
        infered_vectors_list.append(vector)
        i += 1
    vector_df = pd.DataFrame(np.matrix(infered_vectors_list))
    vector_df.to_csv("doc_vecs.csv",index=None,header=None)

def cluster(K):

    df = pd.read_csv("doc_vecs.csv",header=None)
    infered_vectors_list = df.as_matrix()
 
    print ("train kmean model...")
    kmean_model = AgglomerativeClustering(n_clusters=K)
    kmean_model.fit(infered_vectors_list)
    labels= kmean_model.predict(infered_vectors_list)
    #cluster_centers = kmean_model.cluster_centers_
    #print(cluster_centers)
    #sse = kmean_model.inertia_ 

    list_lyrics = get_dataset()

    lines = []
    for i, lyric in enumerate(list_lyrics):
        line = lyric
        line["cluster_id"] = str(labels[i])
        lines.append(line)

    list_lyrics = []
    for inst in tqdm(lines):
        list_lyrics.append(inst)


    with codecs.open('lyrics_with_cluster.json', 'w', encoding='utf-8') as f:
             f.write('\n'.join([json.dumps(tmp, ensure_ascii=False) for tmp in list_lyrics]))
 
    #return cluster_centers

if __name__ == '__main__':
    #get_vectors()
    sse_arr = []
    k_arr = []
    for k in tqdm(np.arange(10,1000,10)):
   	 cluster(k)
5.2 實驗結果

在這裡插入圖片描述
可見,在200到300處,該曲線有拐點,最終選擇300為合適聚類個數。

6. Filter out the duplicates.

對於每一類的歌詞中的每一首歌,計算其simhash值,然後設計聚類演算法類內刪除重複歌詞。

[1] Distributed Representations of Sentences and Documents Quoc Le [email protected] Mikolov [email protected]