1. 程式人生 > >自然語言處理簡潔自用程式碼合集

自然語言處理簡潔自用程式碼合集

記錄文書處理的各種簡介的程式碼表示

1.快速去除中文標點(read的時候要以utf8格式)

def clean_str(string):
    string = re.sub("[^\u4e00-\u9fff]", " ", string)
    string = re.sub(r"\s{2,}", " ", string)#合併多個空格為一個
    return string.strip()

2.快速分詞,預設一行為一樣本

def seperate_line(string):
    return ''.join([word + ' ' for word in jieba.cut(string)])

f=open("xxx"
,'r',encoding="utf8") lines = list(f.readlines()) lines = [clean_str(seperate_line(line)) for line in lines]

3.分行,使得一行為一句

for line in lines
  line.replace('\n','').replace(',','\n').replace('。','\n').replace('!','\n').replace('?','\n')
重新寫入

4.語料訓練集生成

def load_positive_negative_data_files(positive_data_file_path, negative_data_file_path)
:
positive_example_lists = read_and_clean_zh_file(positive_data_file_path) #positive_example_lists ---> 0維度上為樣本有多少句句子,1維度上為每句的string,單詞間空格隔開 negative_example_lists = read_and_clean_zh_file(negative_data_file_path) #positive_example_lists ---> 形式同上 # Combine data x_text = positive_example_lists + negative_example_lists # Generate labels
positive_labels = [[1] for _ in positive_example_lists] negative_labels = [[0] for _ in negative_example_lists] y = np.concatenate([positive_labels, negative_labels], 0) return [x_text, y]

5.句子填充

def padding_sentences(input_sentences, padding_token, padding_sentence_length = None):
    sentences = [sentence.split(' ') for sentence in input_sentences]
    if padding_sentence_length !=None:
        max_sentence_length=padding_sentence_length
    else:
        max_sentence_length=max([len(sentence) for sentence in sentences])
    for i,sentence in generate(sentences):
        if len(sentence) > max_sentence_length:
            sentences[i] = sentence[:max_sentence_length]
        else:
            sentence.extend([padding_token] * (max_sentence_length - len(sentence)))
    return (sentences, max_sentence_length)

6.從gensim訓練模型拿詞向量

model載入
all_vectors = []
embeddingDim = w2vModel.vector_size
embeddingUnknown = [0 for i in range(embeddingDim)]
for sentence in sentences:
    this_vector = []
    for word in sentence:
        if word in w2vModel.wv.vocab:
            this_vector.append(w2vModel[word])
        else:
            this_vector.append(embeddingUnknown)
    all_vectors.append(this_vector)
return all_vectors

7.打亂np矩陣的方法

x=[0,1,2,3,4,5,6]
x=np.array(x)
np.random.seed(10)
shuffle_indices = np.random.permutation(np.arange(len(x)))
print(shuffle_indices)
x_shuffled = x[shuffle_indices]
print(x_shuffled)

輸出
[2 6 0 3 4 5 1]
[2 6 0 3 4 5 1]

8.分離部分樣本為訓練集和驗證集

1.打亂樣本順序(參考上面程式碼)
2.按比例截斷