3h: 中文前處理4：斷句前處理8：標點清洗前處理12：停用詞清洗

阿新 • • 發佈：2019-01-29

0.讀取檔案的處理辦法

# encoding: UTF-8
import re
fileBefPro=open('E:\\dataMining\\data.txt')
fileAftPro=open('E:\\dataMining\\after.txt','a')
iter_f=iter(fileBefPro)
for line in iter_f:#讀一行就操作一行
    #在這裡進行處理哈
    fileAftPro.write(str(line))
fileAftPro.close()
fileBefPro.close()

-------------------------------------------------------------------------------------------
1.斷句
使用的是re中的split
split中使用中文分隔符：https://segmentfault.com/q/1010000002461248

# encoding: UTF-8
import re

str=u"【紅豆杉】紅豆杉作用與功效_紅豆杉抗癌藥品-健客網"
re.split(u'【|】|-|_', str)

for i in re.split(u'【|】|-|_',  str):
    print i

-------------------------------------------------------------------------------------------
2，清洗中文標點符號程式碼：http://blog.csdn.net/mach_learn/article/details/41744487

# encoding: UTF-8
import re
temp = "想要把一大段中文文字中所有的標點符號刪除掉，然後分詞製作語料庫使用，大神們有沒有辦法呢？或者哪位大神有中文語料庫給個連結好不好？我想做新聞的文字相似度分析，提取關鍵詞的時候需要語料庫。謝謝大神們~~~~~ "
temp = temp.decode("utf8")
string = re.sub("[\s+\.\!\/_,$%^*(+\"\']+|[+——！，。？、 
[email protected]#￥%……&*（）]+".decode("utf8"), "".decode("utf8"),temp)
print string

或使用這個網址提供的3個辦法：http://www.itstrike.cn/Question/5860b8a2-6c44-44f4-8726-c5a7603d44cc.html

--------------------------------------------------------------------------------------------
3.停用詞清洗：http://blog.sina.com.cn/s/blog_bccfcaf90101ell5.html
http://blog.csdn.net/sanqima/article/details/50965439 在Python裡安裝Jieba中文分片語件

# encoding: UTF-8
import re
import jieba
#stopword=[line.strip().decode('utf-8') for line in open('E:\\dataMining\\chinese_stopword.txt').readlines()]

stopwords = {}.fromkeys([ line.rstrip() for line in open('E:\\dataMining\\chinese_stopword.txt') ])
segs = jieba.cut('聽說你超級喜歡萬眾掘金小遊戲啊啊啊,或者萬一你不喜歡我咧', cut_all=False)
final=''
for seg in segs:
    seg=seg.encode('utf-8')
    if seg not in stopwords:
         final+=seg
print final

----------------------------------------------------------------------------------------------
上面我們處理的是單個檔案哦
現在的問題：要是我們處理多個檔案咧？

綜上程式碼如下：

# encoding: UTF-8
import sys  
import re  
import codecs  
import os  
import shutil  
import jieba  
import jieba.analyse
  
#匯入自定義詞典  
#jieba  

#Read file and cut  
def read_file_cut():   
    stopwords = {}.fromkeys([ line.strip() for line in open('E:\\dataMining\\chinese_stopword.txt') ])
    #create path
    #要處理檔案的路徑
    path = "E:\\dataMining\\data\\"
    #處理完成後寫入檔案的路徑
    respath="E:\\dataMining\\result\\"
    #isdir(s)是否是一個目錄
    if os.path.isdir(respath):  #如果respath這個路徑存在
        shutil.rmtree(respath, True)  #則遞迴移除這個路徑,os.removedirs(respath) 不能刪除非空目錄
    os.makedirs(respath)  #重新建立一個respath的多級目錄
        
        
    #讀出原始檔案的個數
    total="%d" % len(os.listdir("E:\\dataMining\\data"))
    #一共有total個txt檔案
    print total  
  
    num = 1
    total=int(total)
    while num<=total:
        name = "%d" % num   
        fileName = path + str(name) + ".txt"  
        resName = respath + str(name) + ".txt"  
        source = open(fileName, 'r')  #r表示只讀檔案
        #if os.path.exists(resName):  
         #   os.remove(resName) #remove(path)表示刪除檔案 --removedirs(path)表示刪除多級目錄
        #使用codecs模組提供的方法建立指定編碼格式檔案
        #open(fname,mode,encoding,errors,buffering)
        result = codecs.open(resName, 'w', 'utf-8')  
        line = source.readline()  #讀取一行
        line = line.rstrip('\n')  #除首尾空格 
        while line!="":
            #line = unicode(line, "utf-8") #將unicode轉換成utf-8,才能寫入到檔案中
            output=''
            strr=''
            #斷句 還可以接著新增.......
            for i in re.split(u'【|】|-|_',line):
                strr=strr+i+'\t'
            #清洗中文標點符號 還可以接著新增.......
            string = re.sub("[\.\！\/_,$%^*(+\"\']+|[+——！，。？、[email protected]#￥%……&*（）]+".decode("utf-8"),"",strr)
            #停用詞清洗
            segs = jieba.cut(string,cut_all=False)
            for seg in segs:
                seg=seg.encode('utf-8')
                segs=[seg for seg in segs if seg not in stopwords]
                output = ' '.join(segs)#空格拼接
            print output
            result.write(output + '\r\n')
            line = source.readline()  
        else:  
            print 'End file: ' + str(num)  
            source.close()  
            result.close()  
        num = num + 1
    else:
	    print 'End All'  
  
#Run function  
if __name__ == '__main__':  
    read_file_cut()

3h: 中文前處理4：斷句前處理8：標點清洗前處理12：停用詞清洗

3h: 中文前處理4：斷句前處理8：標點清洗前處理12：停用詞清洗

python使用jieba實現中文文檔分詞和去停用詞

利用java實現對文字的去除停用詞以及分詞處理

Python自然語言處理—停用詞詞典

php 去除常見中文停用詞(過濾敏感詞)

【Python】中文分詞並過濾停用詞

【java HanNLP】HanNLP 利用java實現對文字的去除停用詞以及分詞處理

第二章：基於IK的智慧分詞、細粒度分詞、同義詞、停用詞

使用IKAnalyzer實現中文分詞&去除中文停用詞

Python進行文字預處理（文字分詞，過濾停用詞，詞頻統計，特徵選擇，文字表示）

中文分詞與停用詞的作用

多版本中文停用詞詞表 + 多版本英文停用詞詞表 + python詞表合併程式

如何在java中去除中文文字的停用詞

《思考快與慢》前傳，兩位天才猶太心理學家的傳奇人生與學術故事：4星|《思維的發現》

《資料演算法：Hadoop_Spark大資料處理技巧》艾提拉筆記.docx 第1章二次排序：簡介 19 第2章二次排序：詳細示例 42 第3章 Top 10 列表 54 第4章左外連線 96 第5

ASP.NET Core應用的錯誤處理[4]：StatusCodePagesMiddleware中介軟體如何針對響應碼呈現錯誤頁面

精通Python自然語言處理 4 ：詞性標註--單詞識別

python練習六十二：文件處理，往文件中所有添加指定的前綴

dpkg: 處理軟體包linux-image-4.8.0-36-generic (--remove)時出錯：子程序已安裝post-removal指令碼返回錯誤狀態 1

4.非關系型數據庫（Nosql）之mongodb：普通索引，唯一索引

3h: 中文 前處理4：斷句 前處理8：標點清洗 前處理12：停用詞清洗

相關推薦

3h: 中文前處理4：斷句前處理8：標點清洗前處理12：停用詞清洗