大作業之中文文字分類（終稿）

阿新 • • 發佈：2018-12-22

import os
import numpy as np
import sys
from datetime import datetime
import gc

path = 'H:\大三上大作業\python大作業\date'
import jieba
with open(r'H:\大三上大作業\python大作業\stopsCN.txt', encoding='utf-8') as f:
    stopwords = f.read().split('\n')
#print(stopwords.shape)#檢視停用的字元數量
# for w in stopwords:#檢視stopwords檔案資料
#     print(w)

#文字預處理
def processing(tokens):
    tokens = "".join([char for char in tokens if char.isalpha()])# 去掉非字母漢字的字元
    tokens = [token for token in jieba.cut(tokens, cut_all=True) if len(token) >= 2]#分詞
    tokens = " ".join([token for token in tokens if token not in stopwords])# 去掉停用詞
    return tokens

tokenList = []
targetList = []
for root, dirs, files in os.walk(path):
    # print(root)#地址
    # print(dirs)#子目錄
    # print(files)#詳細檔名
    for f in files:
        filePath = os.path.join(root, f)#地址拼接
        with open(filePath, encoding='utf-8') as f:
            content = f.read()
            target = filePath.split('\\')[-2]
            targetList.append(target)
            tokenList.append(processing(content))


#建模
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report

x_train, x_test, y_train, y_test = train_test_split(tokenList, targetList, test_size=0.3, stratify=targetList)
vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(x_train)
X_test = vectorizer.transform(x_test)
from sklearn.naive_bayes import MultinomialNB

mnb = MultinomialNB()
module = mnb.fit(X_train, y_train)
y_predict = module.predict(X_test)
scores = cross_val_score(mnb, X_test, y_test, cv=5)
print("驗證結果:%.3f" % scores.mean())
print("分類結果:\n", classification_report(y_predict, y_test))

import collections
# 測試集和預測集的各類新聞數量
testCount = collections.Counter(y_test)
predCount = collections.Counter(y_predict)
print('實際：', testCount, '\n', '預測', predCount)
# 建立標籤列表，實際結果與預測結果
nameList = list(testCount.keys())
testList = list(testCount.values())
predictList = list(predCount.values())
x = list(range(len(nameList)))
print("類別：", nameList, '\n', "實際：", testList, '\n', "預測：", predictList)

# 畫圖
import matplotlib.pyplot as plt
from pylab import mpl
mpl.rcParams['font.sans-serif'] = ['FangSong'] # 指定字型
plt.figure(figsize=(7,5))
total_width, n = 0.6, 2
width = total_width / n
plt.bar(x, testList, width=width,label='實際',fc = 'black')
for i in range(len(x)):
    x[i] = x[i] + width
plt.bar(x, predictList,width=width,label='預測',tick_label = nameList,fc='r')
plt.grid()
plt.title('實際和預測對比圖',fontsize=17)
plt.xlabel('新聞類別',fontsize=17)
plt.ylabel('頻數',fontsize=17)
plt.legend(fontsize =17)
plt.tick_params(labelsize=15)
plt.show()

大作業之中文文字分類（終稿）

import os import numpy as np import sys from datetime import datetime import gc path = 'H:\大三上大作業\python大作業\date' import jieba with open(r'H:\大三上大作業\py

cnn、rnn實現中文文字分類（基於tensorflow）

tensorflow版本： In[33]: tf.__version__Out[33]:'1.2.1' 首先是資料獲取： curl -O "ht

如何使用BERT實現中文的文字分類（附程式碼）

如何使用BERT模型實現中文的文字分類前言 Pytorch readme 引數表演算法流程 1. 概述 2. 讀取資料 3. 特徵轉換 4. 模型訓練 5. 模型測試

中文文字分類（機器學習演算法原理與程式設計實踐筆記）

以文字分類演算法為中心，詳細介紹一箇中文文字分類專案的流程及相關知識，知識點涉及中文分詞、向量空間模型、TF-IDF方法、幾個典型的文字分類演算法；主要有樸素貝葉斯演算法，kNN最近鄰演算法。所用到的外部庫：jieba 分詞、Scikit-Learning

利用transformer進行中文文字分類（資料集是復旦中文語料）

利用TfidfVectorizer進行中文文字分類（資料集是復旦中文語料）利用RNN進行中文文字分類（資料集是復旦中文語料）利用CNN進行中文文字分類（資料集是復旦中文語料）和之前介紹的不同，重構了些程式碼，為了使整個流程更加清楚，

mongodb中文文字資料（新聞評論）預處理程式碼（python+java）

中文文字資料預處理 Mongodb資料匯出到txt文件將檔案按行寫入陣列文字批量修改（加字尾等） Mongodb資料匯出到txt文件 #python # coding=utf-8 from pymongo

最大流之Ford-Fulkerson演算法（C++實現）

本文主要講解最大流問題的Ford-Fulkerson解法。可是說這是一種方法，而不是演算法，因為它包含具有不同執行時間的幾種實現。該方法依賴於三種重要思想：殘留網路，增廣路徑和割。一、殘留網路顧名思義，殘留網路是指給定網路和一個流，其對應還可以容納的流組成的網路。具體說來，就是假定一個網

機器學習——文字分類（TF-IDF）

首先，文字資料屬於非結構化資料，一般要轉換成結構化的資料，一般是將文字轉換成“文件-詞頻矩陣”，矩陣中的元素使用詞頻或者TF-IDF。 TF-IDF的主要思想是：如果某一個詞或短語在一篇文章中出現的頻率高，並且在其他文章中很少出現，則認為此詞或短語具有很好的類別區分能力，適

《機器學習系統設計》之應用scikit-learn做文字分類（上）

前言：本系列是在作者學習《機器學習系統設計》（[美] WilliRichert）過程中的思考與實踐，全書通過Python從資料處理，到特徵工程，再到模型選擇，把機器學習解決問題的過程一一呈現。書中設計的原始碼和資料集已上傳到我的資源：http://download

《機器學習系統設計》之應用scikit-learn做文字分類（下）

# inspired by http://scikit- # learn.org/dev/auto_examples/cluster/plot_kmeans_digits.html#example- # cluster-plot-kmeans-digits-py import os import scipy

資料探勘文字分類（二）蒐集中文語料庫與ICTCLAS分詞

在上一篇部落格中簡單介紹了實驗環境和流程，這一篇我們繼續。第一步，下載搜狗中文語料庫。連結：http://www.sogou.com/labs/dl/c.html 我們下載

爬蟲大作業之廣商足球快訊(爬取足球新聞)

描述 brush slist white mat 完整 tps num pat 1.選一個自己感興趣的主題（所有人不能雷同）。主題:爬取足球新聞相關信息 2.用python 編寫爬蟲程序，從網絡上爬取相關主題的數據。 3.對爬了的數據進行文本分析，生成詞雲。 txt

DotNet菜鳥入門之無限極分類（一）設計篇

對數 tar null 擴展 creat nvarchar 鏈表文章數據庫設計寫這個教程的原因，是因為，無限極分類，在許多項目中，都用得到。而對於新手來說，不是很好理解，同時，操作上也有一些誤區或者不當之處。所以我就鬥膽，拋磚引玉一下，已一個常見的後臺左側頻道樹為例子

大作業之zabbix

.sql libevent -c inpu hash openipmi sqli web頁面 har 1、二進制安裝JDK（1.8）：用於java-geteway上傳JDK到/usr/local/src/目錄下，解壓： [root@localhost src]# tar z

福大軟工1816 · 第三次作業 - 結對項目1（原型設計）

war c++ 項目放棄 stage bug 走勢使用一起隊友博客鏈接： https://www.cnblogs.com/Stella12/p/9651791.html 作業pdf: https://files.cnblogs.com/files/YangLi

團隊作業之現場UML設計（demo）

團隊資訊（1分） Jarivs for Chat 各成員短學號、名、本次作業部落格連結姓名學號部落格連結

邁向大神之路 day8 函式（一）……

檔案補充操作檔案讀寫內部連結 read 一次讀取 readline 一行一行度不知道在哪結束 readlines 一次讀取修改檔案的原理（檔案是不能修改的，實在一個檔案修改完成後刪除原始檔並改名） with open('1.txt',

使用機器學習完成中文文字分類

資料集來自七月線上練習 import jieba import pandas as pd import random from sklearn.model_selection import train_test_split #劃分訓練/測試集 from sk

中文文字分類

將文字進行分類是自然語言處理當中最主要的工作之一，本文處理很重要的一項工作就是對文字進行向量化，本文不做詳細的介紹，只是採用TF-IDF的方法對文字進行向量化，然後分別採用SVM, Bayes, RandomForest

大前端之路node第（2）天：Express Generator搭建node專案後臺

使用 Express Generator Express Application Generator 能夠快速建立一個Express應用框架。 npm install express-generator -g express myapp --view=pug cd myapp npm i

大作業之中文文字分類（終稿）

相關推薦