python中使用Word2Vec多核技術進行新聞詞向量訓練

阿新 • • 發佈：2018-12-31

from sklearn.datasets import fetch_20newsgroups

news = fetch_20newsgroups(subset='all')
X,y=news.data,news.target

from bs4 import BeautifulSoup

#匯入nltk和re工具包

import nltk,re

#定義一個函式名為news_to_sentences將新聞中的句子逐一剝離出來，並返回一個句子的列表


def  news_to_sentences(news):
     news_text = BeautifulSoup(news,"html5lib").get_text()
     tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
     raw_sentences = tokenizer.tokenize(news_text)
     sentences=[]
     for sent in raw_sentences:
           sentences.append(re.sub('[^a-zA-Z]',' ',sent.lower().strip()).split())
     return sentences


sentences=[]

for x in X:
       sentences += news_to_sentences(x)

#從gensim.models裡匯入word2vec
from gensim.models import word2vec

#配置詞向量的維度

num_features = 300
#保證被考慮的詞彙的頻度

min_word_count = 20

#設定並行化訓練使用CPU計算核心的數量，多核可用

num_workers = 2


#定義訓練詞向量的上下文視窗大小
 



context = 5
downsampling = 1e-3


from gensim.models  import  word2vec

model = word2vec.Word2Vec(sentences,workers = num_workers,\
                          size = num_features,min_count=min_word_count,\
                          window = context,sample = downsampling)
model.init_sims(replace=True)

print(model.most_similar('hello'))

print(model.most_similar('email'))
print('end')

BeautifulSoup(news).get_text() 函式呼叫會出現警告資訊，

Warning (from warnings module):
File "D:\Python35\lib\site-packages\bs4\__init__.py", line 181
markup_type=markup_type))
UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html5lib"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 1 of the file <string>. To get rid of this warning, change code that looks like this:
BeautifulSoup(YOUR_MARKUP}) to this:
加入html5lib引數，如下

BeautifulSoup(news,"html5lib").get_text() 輸出結果如下：

>>> print(model.most_similar('email'))
[('mail', 0.7399873733520508), ('contact', 0.6850252151489258), ('address', 0.6711879968643188), ('sas', 0.6611512303352356), ('replies', 0.6424497365951538), ('mailed', 0.6364169716835022), ('request', 0.6355448961257935), ('compuserve', 0.6323468685150146), ('send', 0.6153897047042847), ('internet', 0.59690260887146)]
>>> print(model.most_similar('hello'))
[('hi', 0.8492101430892944), ('netters', 0.6953952312469482), ('pl', 0.6211292147636414), ('dear', 0.5891242027282715), ('nh', 0.5402401685714722), ('scotia', 0.5400180220603943), ('tin', 0.5357101559638977), ('elm', 0.5321102142333984), ('greetings', 0.5246435403823853), ('hanover', 0.5063780546188354)]
>>>

python中使用Word2Vec多核技術進行新聞詞向量訓練

from sklearn.datasets import fetch_20newsgroups news = fetch_20newsgroups(subset='all') X,y=news.data,news.target from bs4 import Beaut

python下word2vec詞向量訓練與載入方法

專案中要對短文字進行相似度估計，word2vec是一個很火的工具。本文就word2vec的訓練以及載入進行了總結。word2vec的原理就不描述了，word2vec詞向量工具是由google開發的，輸入為文字文件，輸出為基於這個文字文件的語料庫訓練得到的詞向量模型。通過該模型

機器不學習：word2vec是如何得到詞向量的？

梯度 true day loss class win dex 得到 word2vec 機器不學習 jqbxx.com -機器學習、深度學習好網站 word2vec是如何得到詞向量的？這個問題比較大。從頭開始講的話，首先有了文本語料庫，你需要對語料庫進行預處理，這個處理流

Python中的多線程

info print pre lock map __main__ color 開啟 self 1、什麽是線程　　進程其實不是一個執行單位，進程是一個資源單位　　每個進程內自帶一個線程，線程才是CPU上的執行單位　　如果把操作系統比喻為一座工廠　　在工廠內每造出一個

Python中的多程序小示例

#!/usr/bin/python # -*- coding:utf-8 -*- import requests import json import time from multiprocessing import Pool def func(name): print('

Python中日期和時間進行操作time和datetime

Python中提供了多個用於對日期和時間進行操作的內建模組：time模組、datetime模組和calendar模組。其中time模組是通過呼叫C庫實現的，所以有些方法在某些平臺上可能無法呼叫，但是其提供的大部分介面與C標準庫time.h基本一致。time模組相比，datetime模組提供的介面更直

Python之——Python中的多程序和多執行緒

轉載請註明出處：https://blog.csdn.net/l1028386804/article/details/83042246 一、多程序 Python實現對程序的方式主要有兩種，一種方法是使用os模組中的fork方法，另一種方法是使用multiprocessing模組。區別在於：

極簡使用︱Glove-python詞向量訓練與使用

glove/word2vec/fasttext目前詞向量比較通用的三種方式，其中word2vec來看，在gensim已經可以極快使用（可見：python︱gensim訓練word2vec及相關函式與功能理解）官方glove教程比較囉嗦，可能還得設定一些引數表，操作不是特別方便。筆

python中的多執行緒threading之儲存程序結果Queue

程式碼實現功能，將資料列表中的資料傳入，使用四個執行緒處理，將結果儲存在Queue中，執行緒執行完後，從Queue中獲取儲存的結果 import threading from queue import Queue def job(l, q): for i in range

python中的多執行緒threading之新增執行緒：Thread()

百度百科：多執行緒多執行緒（英語：multithreading），是指從軟體或者硬體上實現多個執行緒併發執行的技術。具有多執行緒能力的計算機因有硬體支援而能夠在同一時間執行多於一個執行緒，進而提升整體處理效能。具有這種能力的系統包括對稱多處理機、多核心處理器以及晶片級多處理（Chi

理解一下Python中的多執行緒,多程序,多協程

程序一個執行的程式（程式碼）就是一個程序，沒有執行的程式碼叫程式，程序是系統資源分配的最小單位，程序擁有自己獨立的記憶體空間，所以程序間資料不共享，開銷大。執行緒，排程執行的最小單位，也叫執行路徑，不能獨立存在，依賴程序存在一個程序至少有一個執行緒,叫主執行緒，而多

Python中的多執行緒程式設計，執行緒安全與鎖(一) 聊聊Python中的GIL 聊聊Python中的GIL python基礎之多執行緒鎖機制 python--threading多執行緒總結 Python3入門之執行緒threading常用方法

1. 多執行緒程式設計與執行緒安全相關重要概念在我的上篇博文聊聊Python中的GIL 中，我們熟悉了幾個特別重要的概念：GIL，執行緒，程序，執行緒安全，原子操作。以下是簡單回顧，詳細介紹請直接看聊聊Python中的GIL GIL:&n

python中使用Word2Vec多核技術進行新聞詞向量訓練

python中使用Word2Vec多核技術進行新聞詞向量訓練

python下word2vec詞向量訓練與載入方法

機器不學習：word2vec是如何得到詞向量的？

Python中的多線程

Python中的多程序小示例

Python中日期和時間進行操作time和datetime

Python之——Python中的多程序和多執行緒

極簡使用︱Glove-python詞向量訓練與使用

python中的多執行緒threading之儲存程序結果Queue

python中的多執行緒threading之新增執行緒：Thread()

理解一下Python中的多執行緒,多程序,多協程

Python中的多執行緒程式設計，執行緒安全與鎖(一) 聊聊Python中的GIL 聊聊Python中的GIL python基礎之多執行緒鎖機制 python--threading多執行緒總結 Python3入門之執行緒threading常用方法

python中的多工

python學習筆記--7.python中的多執行緒

python中的多程序,多執行緒,死鎖,多協程

python中的多程序和多執行緒

Python中的多執行緒程式設計，執行緒安全與鎖(二) Python中的多執行緒程式設計，執行緒安全與鎖(一)

python中的list如何進行相減操作或者將list分片，即list加減

word2vec詞向量訓練及gensim的使用

python中的多執行緒----以2個執行緒賣票為例

python中使用Word2Vec多核技術進行新聞詞向量訓練

相關推薦