論文Multi-Perspective Sentence Similarity Modeling with Convolution Neural Networks實現之資料集製作

阿新 • • 發佈：2018-11-21

1.資料集

本文采用的是STS資料集，如下圖所示，包括所有的2012-2016年的資料，而all資料夾包含2012-2015的所有資料。

每一個檔案的具體資料如下所示，每一行為一個三元組：<相似性得分，句子1，句子2>.

在實現時將all資料夾中的所有資料當作訓練集，將2016年的檔案當作測試集。

1.1資料讀取

採用以下程式碼進行單個檔案的資料讀取：

"""讀取一個數據集檔案"""
def load_one_sts(filename):
    s0s = []
    s1s = []
    labels = []
    num_samples = 0
    with open(filename, 'r', encoding='utf-8') as f:
        for line in f:
            #rstrip:是從字串最右邊刪除了引數指定字元後的字串，不帶引數進去則是去除最右邊的空格
            #strip:同時去除左右兩邊指定的字元,不帶引數進去則是去除空格
            data = line.rstrip()
            # line = data.split('\t')
            # print(line)
            label, s0, s1 = data.split('\t')
            #如果沒有對應的相似性得分，則直接跳過
            if label == '':
                continue
            else:
                score = round(float(label)) #如果距離兩邊一樣遠，會保留到偶數的一邊。比如round(0.5)和round(-0.5)都會保留到0，而round(1.5)會保留到2
                # scores.append(score)
                """經驗證可知score的取值範圍為0-5，故標籤使用one-hot encoding，數目為6"""
                y = [0] * 6  #此時的y是一個list
                y[score] = 1 #將score值所對應的位置置為1
                labels.append(np.array(y)) #此時label轉換完成,內建元素應為array
                # labels = np.asarray(labels)
                num_samples = len(labels)
                s0s.append(s0)
                s1s.append(s1)

注意：如上面所示程式碼data = line.rstrip()，在本地檔案中，有的兩個句子是沒有對應的相似度得分的，此時對應欄位為空，如果使用data = line.strip()，程式會將這一行左面的空格去掉，在後面進行迴圈讀取每個檔案的時候會報錯：You don't get unenough unpacks(expected 3, get 2)這樣的資訊。

此時已經將相似度得分，句子1，句子2分別儲存，接下來就是將每一個句子對映成id索引的組合形式，這就需要讀入GloVe模型，以下程式碼為讀入GloVe模型的輔助函式（這些函式在另一個檔案embedding.py中）：

import word2vec
import os
import shutil
from sys import platform
import numpy as np
import pandas as pd

# 計算行數，就是單詞數
def getFileLineNums(filename):
    f = open(filename, 'r', encoding='utf-8')
    count = 0
    for line in f:
        count += 1
    return count

# Linux或者Windows下開啟詞向量檔案，在開始增加一行
def prepend_line(infile, outfile, line):
    with open(infile, 'r', encoding='utf-8') as old:
        with open(outfile, 'w', encoding='utf-8') as new:
            new.write(str(line) + "\n")
            shutil.copyfileobj(old, new)

def prepend_slow(infile, outfile, line):
    with open(infile, 'r', encoding='utf-8') as fin:
        with open(outfile, 'w', encoding='utf-8') as fout:
            fout.write(line + "\n")
            for line in fin:
                fout.write(line)

"""生成符合word2vec工具讀取格式的模型檔案"""
def normalize_data(filename):
    num_lines = getFileLineNums(filename)
    model_file = 'glove_model_50d.txt'
    model_first_line = "{} {}".format(num_lines, 50)
    # Prepends the line.
    if platform == "linux" or platform == "linux2":
        prepend_line(filename, model_file, model_first_line)
    else:
        prepend_slow(filename, model_file, model_first_line)
    print('模型向量檔案資料已規範化！後續請使用檔案', model_file)

"""讀取Glove模型，生成id和詞向量"""
def load_glove_model(glove_model_path):
    normalize_data(glove_model_path)
    wv = word2vec.load('glove_model_50d.txt')
    print('GloVe模型載入完畢！')
    vocab = wv.vocab
    word2id = pd.Series(range(1, len(vocab)+1), index=vocab)
    #將未知詞對應的id設定為0，對應word_embedding中的第0行
    word2id['<unk>'] = 0
    # print(word2id[399990:])
    print('word2id轉換完成，未知詞使用<unk>識別符號！')
    word_embedding = wv.vectors
    #採取均值作為未知詞的詞向量表示
    word_mean = np.mean(word_embedding, axis=0)
    word_embedding = np.vstack([word_mean, word_embedding])
    # print(word_embedding[:2])
    print('id詞向量嵌入完成！')

    return word2id, word_embedding

由於官方提供的GloVe檔案格式並不符合word2vec工具讀取的要求，故使用其中的normalize_data()將其標準化，未知詞采用‘<unk>’標記，繼而呼叫load_glove_model()得到word2id和word_embedding。得到詞對應的嵌入向量之後，對第一步讀取到的資料進行對映。

"""通過單詞獲取id"""
def get_id(word):
    if word in word2id:
        return word2id[word]
    else:
        return word2id['<unk>']

"""資料清洗並將句子表示成索引組合"""
def seq2id(texts):
    texts = clean_text(texts)
    texts = texts.split(' ')
    texts_id = map(get_id, texts)
    return texts_id

"""填充句子, padding_length:句子填充的長度"""
def padding_sentence(s0, s1, padding_length):
    sentence_num = len(s1)
    # sentence_length = 100
    # print('句子填充長度為100')
    s0s = []
    s1s = []

    for s in s0:
        left = padding_length -len(s)
        pad = [0] * left
        s= list(s)
        s.extend(pad)
        s0s.append(np.array(s))
    for s in s1:
        left = padding_length -len(s)
        pad = [0] * left
        s= list(s)
        s.extend(pad)
        s1s.append(np.array(s))

    # print('%d個句子填充完畢！'%sentence_num)
    return s0s, s1s

上面所示為句子對映的輔助函式，其中seqid()用來得到句子各個單詞對應的id。實際情況中每條句子的長度都不一樣，導致輸入網路的tensor長度也不一致，故此處呼叫padding_sentence()填充各個句子（此處使用定長100，有些地方使用最長句子的長度來進行填充）。

其中資料清洗函式如下所示，對於其中的標點（！、......等）、縮寫（You're替換成You 're）等進行處理：

"""資料清洗函式"""
def clean_text(line):
    # print('過濾前--------------->', line)
    #替換掉無意義的單個字元
    line = re.sub(r'[^A-Za-z0-9(),!?.\'\`]', ' ', line)
    """使用空格將單詞字尾單獨分離開來"""
    line = re.sub(r'\'s', ' \'s ', line)
    line = re.sub(r'\'ve', ' \'ve ', line)
    line = re.sub(r'n\'t', ' n\'t ', line)
    line = re.sub(r'\'re', ' \'re ', line)
    line = re.sub(r'\'d', ' \'d ',line)
    line = re.sub(r'\'ll', ' \'ll ',line)
    """使用空格將標點符號、括號等字元單獨分離開來"""
    line = re.sub(r',', ' , ', line)
    line = re.sub(r'!', ' ! ', line)
    line = re.sub(r'\?', ' \? ', line)
    line = re.sub(r'\(', ' ( ', line)
    line = re.sub(r'\)', ' ) ', line)
    line = re.sub(r'\s{2,}', ' ', line)
    # line = re.sub(r'\n', '', line)
    # line = re.sub(r'')
    # line = re.sub(r',', ' , ', line)
    # print('過濾後--------------->',line)
    return line.strip().lower()

做完上述動作，繼續在load_one_sts()函式中進行編輯：

def load_one_sts(filename):
    """以下是緊接著第一部分繼續編寫的程式碼，二者合起來才是一個完整的函式"""
    s0s_id = []
    for s0 in s0s:
        s0_id = list(seq2id(s0))
        # print(type(s0_id))
        # print('s0_id:', s0_id)
        s0s_id.append(np.asarray(s0_id))

    s1s_id = []
    for s1 in s1s:
        s1_id = list(seq2id(s1))
        s1s_id.append(np.asarray(s1_id))

    #句子填充,填充長度為100
    s0_padding, s1_padding= padding_sentence(s0s_id, s1s_id, 100)
    # print(len(s0_padding[0]))
    # print(s0_padding[0])
    return s0_padding, s1_padding, labels

如上所示，單個檔案的讀取編寫完畢，接下來需要遍歷某個路徑下的所有檔案，將所得到的資料放入s0,s1,labels中。

"""將不同檔案的資料進行拼接"""
def concat(data):
    s0s = []
    s1s = []
    labels = []
    for s0, s1, label in data:
        s0s += s0
        s1s += s1
        labels += label
    s0s = np.asarray(s0s)
    s1s = np.asarray(s1s)
    labels = np.asarray(labels)
    return s0s, s1s, labels


"""讀取整個資料集"""
def load_datasets(path):
    files = []
    #列出路徑path下所有的檔案
    for dirpath,dirnames,filenames in os.walk(path):
        for filename in filenames:
            # print(os.path.join(dirpath,filename))
            files.append(dirpath + '/' + filename)

    s0, s1, labels = concat([load_one_sts(file) for file in files])

    return ([s0, s1], labels)

這樣，我們就完成了某個路徑下的所有資料檔案的讀取工作。

為了放心，對其進行測試，讀取測試集和訓練集：

print('讀取訓練集-------》')
path = './sts/semeval-sts/all'
x_train, y_train = load_datasets(path)
print('訓練集樣本數：', len(y_train))

print('讀取測試集-------》')
path = './sts/semeval-sts/2016'
x_test, y_test = load_datasets(path)
print('測試集樣本數：', len(y_test))

程式執行結果如下所示：

OK，至此我們就完成了資料集的讀取和建立。下一步就是建立神經網路模型MPCNN，並應用其中的相似度計算公式來得到相似度得分。

論文Multi-Perspective Sentence Similarity Modeling with Convolution Neural Networks實現之資料集製作

1.資料集

1.1資料讀取

論文Multi-Perspective Sentence Similarity Modeling with Convolution Neural Networks實現之資料集製作

論文Multi-Perspective Sentence Similarity Modeling with Convolution Neural Networks實現之網路模型搭建及訓練

Multi-Perspective Sentence Similarity Modeling with Convolutional Neural Networks的理解以及翻譯

Language Modeling with Gated Convolutional Networks

[CVPR2015] Is object localization for free? – Weakly-supervised learning with convolutional neural networks論文筆記

論文筆記：Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling

Ranking with Recursive Neural Networks and Its Application to Multi-document Summarization

論文筆記 / Mitosis Detection in Breast Cancer Histology Images with Deep Neural Networks

【論文筆記1】RNN在影象壓縮領域的運用——Variable Rate Image Compression with Recurrent Neural Networks

【論文閱讀】Bag of Tricks for Image Classification with Convolutional Neural Networks

【論文筆記2】影象壓縮神經網路在Kodak資料集上首次超越JPEG——Full Resolution Image Compression with Recurrent Neural Networks

【論文閱讀】Learning Spatiotemporal Features with 3D Convolutional Networks

【醫學影像】《Dermatologist-level classification of skin cancer with deep neural networks》論文筆記

AlphaGo論文的譯文，用深度神經網路和樹搜尋征服圍棋：Mastering the game of Go with deep neural networks and tree search

Mastering the game of Go with deep neural networks and tree search

Sentiment Analysis with Recurrent Neural Networks in TensorFlow 利用TensorFlow迴歸神經網路進行情感分析 Pluralsigh

Mastering the game of Go with deep neural networks and tree search譯文

Sentiment Analysis with Recurrent Neural Networks in TensorFlow 利用TensorFlow迴歸神經網路進行情感分析 Pluralsigh

Bag of Tricks for Image Classification with Convolutional Neural Networks

論文閱讀-(CVPR 2017) Kernel Pooling for Convolutional Neural Networks

論文Multi-Perspective Sentence Similarity Modeling with Convolution Neural Networks實現之資料集製作

1.資料集

1.1資料讀取

相關推薦