gensim與numpy array 互轉

阿新 • • 發佈：2018-03-20

字母轉 IT aid coo tool rdo ont array dense

目的

  將gensim輸出的格式轉化為numpy array格式，支持作為scikit-learn，tensorflow的輸入

實施

使用nltk庫的停用詞和網上收集的資料整合成一份新的停用詞表，用來過濾文檔中的停用詞，也去除了數字和特殊的標點符號，最後將所有字母轉化為小寫形式。

以下是原文：

Subject: Re: Candida(yeast) Bloom, Fact or Fiction
From: [email protected] (Pat Churchill)
Organization: Actrix Networks
Lines: 17

I am currently in the throes of a hay fever attack. SO who certainly
never reads Usenet, let alone Sci.med, said quite spontaneously "
There are a lot of mushrooms and toadstools out on the lawn at the
moment. Sure that‘s not your problem?"

Well, who knows? Or maybe it‘s the sourdough bread I bake?

After reading learned, semi-learned, possibly ignorant and downright
ludicrous stuff in this thread, I am about ready to believe anything :-)

If the hayfever gets any worse, maybe I will cook those toadstools...

--
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The floggings will continue until morale improves
[email protected] Pat Churchill, Wellington New Zealand
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

下面是分詞後的結果：
[‘subject‘, ‘bloom‘, ‘fiction‘, ‘pat‘, ‘organization‘, ‘networks‘, ‘lines‘, ‘hay‘, ‘fever‘, ‘attack‘, ‘reads‘, ‘usenet‘, ‘lot‘, ‘lawn‘, ‘moment‘, ‘bread‘, ‘reading‘, ‘learned‘, ‘ignorant‘, ‘stuff‘, ‘thread‘, ‘ready‘, ‘worse‘, ‘cook‘, ‘continue‘, ‘pat‘, ‘wellington‘, ‘zealand‘]

使用gensim工具包轉化為詞袋子模型如下：

[(17, 1.0), (23, 1.0), (32, 1.0), (381, 1.0), (536, 1.0), (768, 1.0), (776, 1.0), (877, 1.0), (950, 1.0), (1152, 1.0), (1195, 1.0), (1389, 1.0), (1548, 1.0), (1577, 1.0), (1682, 2.0), (1861, 1.0), (2041, 1.0), (3098, 1.0), (3551, 1.0), (3886, 1.0), (5041, 1.0), (5148, 1.0), (5149, 1.0), (8494, 1.0), (8534, 1.0), (9972, 1.0), (11608, 1.0)]

上述gensim轉換的格式不能直接作為scikit-learn，tensorflow的輸入，需要使用
Gensim包的工具函數進行轉換, 轉換後變為：
[ 0. 0. 0. ..., 0. 0. 0.]
···
custom_train_matrix = gensim.matutils.corpus2dense(custom_train_corpus, num_terms=len_custom_dict).T # 關鍵方法為corpus2dense
custom_test_matrix = gensim.matutils.corpus2dense(custom_test_corpus, num_terms=len_custom_dict).T
···

上述輸出的格式已經是numpy arrary格式了，可以作為scikit-learn，tensorflow的輸入了。我們使用tf-idf技術，提高重要單詞的比重，降低常見單詞的比重，使用sckit-learn包轉換上述輸出如下：
(0, 11608) 0.179650698812
(0, 9972) 0.235827148023
(0, 8534) 0.208306524508
...................
(0, 381) 0.119518580927
(0, 32) 0.034904390394
(0, 23) 0.0374215245774
(0, 17) 0.0349485731726

上述輸入已經可以作為scikit-learn的輸入數據了，如果要作為tensorflow的輸入數據，還需要將其轉化為numpy array格式
···
custom_train_matrix = custom_train_matrix.toarray() # 將稀疏矩陣轉化為numpy array
custom_test_matrix = custom_test_matrix.toarray() # 將稀疏矩陣轉化為numpy array
···

gensim與numpy array 互轉

字母轉 IT aid coo tool rdo ont array dense 目的將gensim輸出的格式轉化為numpy array格式，支持作為scikit-learn，tensorflow的輸入實施使用nltk庫的停用詞和網上收集的資料整合成一份新的停用詞

gensim與numpy array 互轉

gensim與numpy array 互轉

Java中net.sf.json包關於JSON與對象互轉的坑

集合與數組互轉

java 字符與ASCII碼互轉

縱表與橫表互轉實例

SQL縱表與橫表互轉

JDK1.8 LocalDateTime 時間類與字符互轉

javascript與php時/分/秒與秒數互轉

CAD小技巧-怎麽將DWG與DXF進行互轉？

IOS中的NSString與NSArray的互轉

js字元與ASCII碼互轉的方法

xstream--xml工具類--java物件轉換JSONObject、xml與java物件互轉

影象RGB2YUV與YUV2RGB格式互轉介紹

CAD小技巧-怎麼將DWG與DXF進行互轉？

java 可變引數,集合與陣列的互轉,靜態匯入

IOS編碼GB2312與UTF-8互轉

C#中DataTable與List的互轉

時間戳與時間字串互轉的工具類

檔案與base64 字串互轉

Go string和interface{}與其他型別互轉

gensim與numpy array 互轉

相關推薦