python 進行文字相似性對比

阿新 • • 發佈：2019-02-14

糾正：在機器學習系統設計一書中，關於求歐幾里得範數是使用scipy下的linagl.norm來求的，在實際中用的應該是numpy中的linalg.norm來求的，當然也可能是我下載的scipy包和書中的不一樣一種文字相似性度量的方式叫做 --Levenshtein距離，也叫做編輯距離也就是是表示從一個單詞轉換到另一個單詞所有的最小距離比較編輯距離的一種方法叫做詞袋方法，他是基於詞頻統計的 ------------------------------------------------------------------------------------------------------- 這是關於詞頻統計的用的包的一些練習程式碼
CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict', dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content', lowercase=True, max_df=1.0, max_features=None, min_df=1, ngram_range=(1, 1), preprocessor=None, stop_words=None, strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b', tokenizer=None, vocabulary=None) 統計詞頻可以使用這個包 from sklearn.feature_extraction.text import CountVectorizer vectorizer = CountVectorizer(min_df=1) 設定引數，出現小於一次的就刪了 ------------------------------------------ 實驗部分： vectorizer = CountVectorizer(min_df=1) #print vectorizer contex = [r'how to format my hard disk' , r'hard disk format problems'] X = vectorizer.fit_transform(contex) print vectorizer.get_feature_names() print X.toarray().transpose() ++++++++++++++++++++++++++++ result： [u'disk', u'format', u'hard', u'how', u'my', u'problems', u'to'] [[1 1] [1 1] [1 1] [1 0] [1 0] [0 1] [1 0]] ------------------------------------------------------ 讀取檔案見檔案方法，使用os包 DIRs = r'D:\workspace\bulid_ML_system\src\first\toy' post = [open

(os.path.join(DIRs ,f)).read() for f in os.listdir(DIRs)] --------------------------------------------------------- 計算歐幾里得範數 def dist_raw(v1, v2): delta = v1 - v2 return np.linalg.norm(delta.toarray()) ------------------------------------------------------------ 通過相似度來測量相似的文字： best_doc = None best_dist = sys.maxint best_i = None #print range(0 , num_sample) for i in range( 0 , num_sample): contexs = contex[i] if contexs == new_contex: continue cocnt_vec = x_train.getrow(i) d = dist_raw(cocnt_vec, new_contex_vec) print '=====%i=======%.2f====:%s'%(i,d,contexs) if d<best_dist: best_dist = d best_i = i print 'best=%i====is===%.2f'%(best_i,best_dist) ------------------------------------------------------------------------- 刪去一些無關緊要的詞：可以使用vectorize = CountVectorizer(min_df=1 , stop_words=['interesting'])設定stop_word來設定，他可以是一個list，也可以直接輸入english，假如輸入的english，那麼他會過濾掉318個常見的單詞（一般是出現頻率高，而沒有什麼實際用處的詞）

這是使用了stop_word之前和之後的對比： [u'about', u'actually', u'capabilities', u'contains', u'data', u'databases', u'images', u'imaging', u'interesting', u'is', u'it', u'learning', u'machine', u'most', u'much', u'not', u'permanently', u'post', u'provide', u'safe', u'storage', u'store', u'stuff', u'this', u'toy'] [u'actually', u'capabilities', u'contains', u'data', u'databases', u'images', u'imaging', u'interesting', u'learning', u'machine', u'permanently', u'post', u'provide', u'safe', u'storage', u'store', u'stuff', u'toy'] 這是使用個list來排除interesting的結果： [u'about', u'actually', u'capabilities', u'contains', u'data', u'databases', u'images', u'imaging', u'is', u'it', u'learning', u'machine', u'most', u'much', u'not', u'permanently', u'post', u'provide', u'safe', u'storage', u'store', u'stuff', u'this', u'toy']

python 進行文字相似性對比

python 進行文字相似性對比

python進行文字分類，基於word2vec,sklearn-svm對微博垃圾評論分類

使用python進行文字替換（包括替換檔名、資料夾名、文字名）

使用Python進行文字資訊的比較並生成HTML報告

python進行文字分類，基於word2vec,sklearn-svm對微博性別分類

[硬貨]|如何利用深度學習寫詩歌（使用Python進行文字生成）

Python進行文字預處理（文字分詞，過濾停用詞，詞頻統計，特徵選擇，文字表示）

Python使用jieba分詞並用weka進行文字分類

Python 文字挖掘：使用gensim進行文字相似度計算

windows中使用Python進行AES加密解密-文字檔案加密工具

《用Python進行自然語言處理》程式碼筆記（五）：第七章：從文字提取資訊

【NLP】乾貨！Python NLTK結合stanford NLP工具包進行文字處理

python文字資訊對比

Python使用doc2vec和LR進行文字分類

python利用百度API進行文字識別

python進行兩個表格對比

用python讀取文字資訊，進行處理，寫到另一檔案中

用Python進行網頁抓取

利用python進行數據分析——histogram

Python進行數據分析之一：相關Package的安裝

python 進行文字相似性對比

相關推薦