ML.NET教程之情感分析(二元分類問題)

阿新 • • 發佈：2018-12-09

機器學習的工作流程分為以下幾個步驟：

理解問題
準備資料
- 載入資料
- 提取特徵
構建與訓練
- 訓練模型
- 評估模型
執行
- 使用模型

理解問題

本教程需要解決的問題是根據網站內評論的意見採取合適的行動。

可用的訓練資料集中，網站評論可能是有毒(toxic)(1)或者無毒(not toxic)(0)兩種型別。這種場景下，機器學習中的分類任務最為適合。

分類任務用於區分資料內的類別(category)，型別(type)或種類(class)。常見的例子有：

識別情感是正面或是負面
將郵件按照是否為垃圾郵件歸類

判定病人的實驗室樣本是否為癌症
按照客戶的偏好進行分類以響應銷售活動

分類任務可以是二元又或是多元的。這裡面臨的是二元分類的問題。

準備資料

首先建立一個控制檯應用程式，基於.NET Core。完成搭建後，新增Microsoft.ML類庫包。接著在工程下新建名為Data的資料夾。

之後，下載WikiPedia-detox-250-line-data.tsv與wikipedia-detox-250-line-test.tsv檔案，並將它們放入Data資料夾，值得注意的是，這兩個檔案的Copy to Output Directory屬性需要修改成Copy if newer。

載入資料

在Program.cs檔案的Main方法里加入以下程式碼：

MLContext mlContext = new MLContext(seed: 0);

_textLoader = mlContext.Data.TextReader(new TextLoader.Arguments()
{
    Separator = "tab",
    HasHeader = true,
    Column = new[]
                {
                    new TextLoader.Column("Label", DataKind.Bool, 0),
                    new TextLoader.Column("SentimentText", DataKind.Text, 1)
                }
});

其目的是通過使用TextLoader類為資料的載入作好準備。

Column屬性中構建了兩個物件，即對應資料集中的兩列資料。不過第一列這裡必須使用Label而不是Sentiment。

提取特徵

新建一個SentimentData.cs檔案，其中加入SentimentData類與SentimentPrediction。

public class SentimentData
{
    [Column(ordinal: "0", name: "Label")]
    public float Sentiment;
    [Column(ordinal: "1")]
    public string SentimentText;
}

public class SentimentPrediction
{
    [ColumnName("PredictedLabel")]
    public bool Prediction { get; set; }

    [ColumnName("Probability")]
    public float Probability { get; set; }

    [ColumnName("Score")]
    public float Score { get; set; }
}

SentimentData類中的SentimentText為輸入資料集的特徵，Sentiment則是資料集的標記(label)。

SentimentPrediction類用於模型被訓練後的預測。

訓練模型

在Program類中加入Train方法。首先它會讀取訓練資料集，接著將特徵列中的文字型資料轉換為浮點型陣列並設定了訓練時所使用的決策樹二元分類模型。之後，即是實際訓練模型。

public static ITransformer Train(MLContext mlContext, string dataPath)
{
    IDataView dataView = _textLoader.Read(dataPath);
    var pipeline = mlContext.Transforms.Text.FeaturizeText("SentimentText", "Features")
        .Append(mlContext.BinaryClassification.Trainers.FastTree(numLeaves: 50, numTrees: 50, minDatapointsInLeaves: 20));

    Console.WriteLine("=============== Create and Train the Model ===============");
    var model = pipeline.Fit(dataView);
    Console.WriteLine("=============== End of training ===============");
    Console.WriteLine();

    return model;
}

評估模型

加入Evaluate方法。到了這一步，需要讀取的是用於測試的資料集，且讀取後的資料仍然需要轉換成合適的資料型別。

public static void Evaluate(MLContext mlContext, ITransformer model)
{
    IDataView dataView = _textLoader.Read(_testDataPath);
    Console.WriteLine("=============== Evaluating Model accuracy with Test data===============");
    var predictions = model.Transform(dataView);

    var metrics = mlContext.BinaryClassification.Evaluate(predictions, "Label");
    Console.WriteLine();
    Console.WriteLine("Model quality metrics evaluation");
    Console.WriteLine("--------------------------------");
    Console.WriteLine($"Accuracy: {metrics.Accuracy:P2}");
    Console.WriteLine($"Auc: {metrics.Auc:P2}");
    Console.WriteLine($"F1Score: {metrics.F1Score:P2}");
    Console.WriteLine("=============== End of model evaluation ===============");
}

使用模型

訓練及評估模型完成後，就可以正式使用它了。這裡需要建立一個用於預測的物件(PredictionFunction)，其預測方法的輸入引數是SentimentData型別，返回結果為SentimentPrediction型別。

private static void Predict(MLContext mlContext, ITransformer model)
{
    var predictionFunction = model.MakePredictionFunction<SentimentData, SentimentPrediction>(mlContext);
    SentimentData sampleStatement = new SentimentData
    {
        SentimentText = "This is a very rude movie"
    };

    var resultprediction = predictionFunction.Predict(sampleStatement);

    Console.WriteLine();
    Console.WriteLine("=============== Prediction Test of model with a single sample and test dataset ===============");

    Console.WriteLine();
    Console.WriteLine($"Sentiment: {sampleStatement.SentimentText} | Prediction: {(Convert.ToBoolean(resultprediction.Prediction) ? "Toxic" : "Not Toxic")} | Probability: {resultprediction.Probability} ");
    Console.WriteLine("=============== End of Predictions ===============");
    Console.WriteLine();
}

完整示例程式碼

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using Microsoft.ML;
using Microsoft.ML.Core.Data;
using Microsoft.ML.Runtime.Data;
using Microsoft.ML.Transforms.Text;

namespace SentimentAnalysis
{
    class Program
    {
        static readonly string _trainDataPath = Path.Combine(Environment.CurrentDirectory, "Data", "wikipedia-detox-250-line-data.tsv");
        static readonly string _testDataPath = Path.Combine(Environment.CurrentDirectory, "Data", "wikipedia-detox-250-line-test.tsv");
        static readonly string _modelPath = Path.Combine(Environment.CurrentDirectory, "Data", "Model.zip");
        static TextLoader _textLoader;

        static void Main(string[] args)
        {
            MLContext mlContext = new MLContext(seed: 0);

            _textLoader = mlContext.Data.TextReader(new TextLoader.Arguments()
            {
                Separator = "tab",
                HasHeader = true,
                Column = new[]
                            {
                                new TextLoader.Column("Label", DataKind.Bool, 0),
                                new TextLoader.Column("SentimentText", DataKind.Text, 1)
                            }
            });

            var model = Train(mlContext, _trainDataPath);

            Evaluate(mlContext, model);

            Predict(mlContext, model);

            Console.Read();
        }

        public static ITransformer Train(MLContext mlContext, string dataPath)
        {
            IDataView dataView = _textLoader.Read(dataPath);
            var pipeline = mlContext.Transforms.Text.FeaturizeText("SentimentText", "Features")
                .Append(mlContext.BinaryClassification.Trainers.FastTree(numLeaves: 50, numTrees: 50, minDatapointsInLeaves: 20));

            Console.WriteLine("=============== Create and Train the Model ===============");
            var model = pipeline.Fit(dataView);
            Console.WriteLine("=============== End of training ===============");
            Console.WriteLine();

            return model;
        }

        public static void Evaluate(MLContext mlContext, ITransformer model)
        {
            IDataView dataView = _textLoader.Read(_testDataPath);
            Console.WriteLine("=============== Evaluating Model accuracy with Test data===============");
            var predictions = model.Transform(dataView);

            var metrics = mlContext.BinaryClassification.Evaluate(predictions, "Label");
            Console.WriteLine();
            Console.WriteLine("Model quality metrics evaluation");
            Console.WriteLine("--------------------------------");
            Console.WriteLine($"Accuracy: {metrics.Accuracy:P2}");
            Console.WriteLine($"Auc: {metrics.Auc:P2}");
            Console.WriteLine($"F1Score: {metrics.F1Score:P2}");
            Console.WriteLine("=============== End of model evaluation ===============");
        }

        private static void Predict(MLContext mlContext, ITransformer model)
        {
            var predictionFunction = model.MakePredictionFunction<SentimentData, SentimentPrediction>(mlContext);
            SentimentData sampleStatement = new SentimentData
            {
                SentimentText = "This is a very rude movie"
            };

            var resultprediction = predictionFunction.Predict(sampleStatement);

            Console.WriteLine();
            Console.WriteLine("=============== Prediction Test of model with a single sample and test dataset ===============");

            Console.WriteLine();
            Console.WriteLine($"Sentiment: {sampleStatement.SentimentText} | Prediction: {(Convert.ToBoolean(resultprediction.Prediction) ? "Toxic" : "Not Toxic")} | Probability: {resultprediction.Probability} ");
            Console.WriteLine("=============== End of Predictions ===============");
            Console.WriteLine();
        }
    }
}

程式執行後顯示的結果：

=============== Create and Train the Model ===============
=============== End of training ===============

=============== Evaluating Model accuracy with Test data===============

Model quality metrics evaluation
--------------------------------
Accuracy: 83.33%
Auc: 98.77%
F1Score: 85.71%
=============== End of model evaluation ===============

=============== Prediction Test of model with a single sample and test dataset ===============

Sentiment: This is a very rude movie | Prediction: Toxic | Probability: 0.7387648
=============== End of Predictions ===============

可以看到在預測This is a very rude movie(這是一部粗製濫造的電影)這句評論時，模型判定其是有毒的:-)

ML.NET教程之情感分析(二元分類問題)

機器學習的工作流程分為以下幾個步驟：理解問題準備資料載入資料提取特徵構建與訓練訓練模型評估模型執行使用模型理解問題本教程需要解決的問題是根據網站內評論的意見採取合適的行動。

ML.NET教程之計程車車費預測(迴歸問題)

理解問題計程車的車費不僅與距離有關，還涉及乘客數量，是否使用信用卡等因素(這是的計程車是指紐約市的)。所以並不是一個簡單的一元方程問題。準備資料建立一控制檯應用程式工程，新建Data資料夾，在其目錄下新增taxi-fare-train.csv與taxi-fare-test.csv檔案，不要忘了把它

ML.NET教程之客戶細分(聚類問題)

理解問題客戶細分需要解決的問題是按照客戶之間的相似特徵區分不同客戶群體。這個問題的先決條件中沒有可供使用的客戶分類列表，只有客戶的人物畫像。資料集已有的資料是公司的歷史商業活動記錄以及客戶的購買記錄。 offer.csv： Offer #,Campaign,Varietal,Minimum Qt

文字分類之情感分析 – 樸素貝葉斯分類器

情感分析正成為研究和社交媒體分析的熱點領域，尤其是在使用者評論和微博上。它是文字挖掘的一種特殊情況，一般關注在識別正反觀點上，雖然它常不很準確，它仍然是有用的。為簡單起見（因為訓練資料容易獲取），我將重點放在2個可能的情感分類：積極的和消極的。 NLTK 樸素貝葉斯分

自然語言處理之情感分析與觀點挖掘

觀點、情感以及與之相關的許多概念，如評價、評估、態度、感情、情緒和心情，與我們主觀的感覺和感受密切相關。這些是人類心理活動的核心要素，也是影響人們日常行為的關鍵因素。情感分析也稱為觀點挖掘，是一個旨在利用可計算的方法從自然語言文字中提取觀點和情感資訊的研究課題。一.情感分

NLP之情感分析：基於python程式設計(jieba庫)實現中文文字情感分析(得到的是情感評分)

NLP之情感分析：基於python程式設計(jieba庫)實現中文文字情感分析(得到的是情感評分) 輸出結果 1、測試物件 data1= '今天上海的天氣真好！我的心情非常高興！如果去旅遊的話我會非常興奮！和你一起去旅遊我會更加幸福！' data2= '今天上海天氣真差,非常討厭下雨,把

python_NLP實戰之情感分析

情感分析的基本方法有：詞法分析，基於機器學習的分析，混合分析詞法分析運用了由預標記詞彙組成的詞典，使用詞法分析器將輸入文字轉換為單詞序列，將每個新的單詞與字典中的詞彙進行匹配。機器學習方法的關鍵是合適特徵的選擇。通常有unigram,bigrams,trigrams選

SPSS教程之生存分析的Cox迴歸模型（比例風險模型）

最近有同學問師兄，“最近我要做生存分析，可是我不太會，也不太懂，師兄能不能教教我”，好吧，今天開一貼，講講這個。有同樣的問題的同學可以一起來看看，畢竟在臨床、科研上，這方面知識還是很受用的。有什麼想跟師兄討論的，可以關注微訊號公眾號：金融小圈子，就這樣吧。讓我們開始征程。

[QT]QT教程之例項分析[一]檔案顏色和字型對話方塊

重點知識已近在程式碼裡註釋... 請仔細看程式碼本文原創標頭檔案 standardialog .h #ifndef STANDARDIALOG_H #define STANDARDIALOG_H #include <QObject> #include <

語義分析之情感分析

紙上得來終覺淺，一直苦於沒有小專案來看看鍛鍊下自己，相信很多初學程式設計的同學也一樣，那就是不知道自己到底學的怎麼樣，而且也覺得沒有一個實際的專案來幫助提高自己的實際動手能力，理論總是美好的，在實際的專案中會碰到這樣那樣的小問題，而且每一個問題都不是書上全部講到的，就比如我將

NLP之情感分析：SnowNLP

blog bash 提取關鍵字用戶 nic 你們 nltk .cn 推薦一安裝與介紹 SnowNLP是一個python寫的類庫，可以方便的處理中文文本內容，是受到了TextBlob的啟發而寫的，由於現在大部分的自然語言處理庫基本都是針對英文的，於是寫了一個方便處理中文

文字挖掘之情感分析（一）

一、文字挖掘文字挖掘則是對文字進行處理，從中挖掘出來文字中有用的資訊和關鍵的規則，在文字挖掘領域應用最往廣泛的是對文字進行分類和聚類，其挖掘的方法分為無監督學習和監督學習。文字挖掘還可以劃分為7大類：關鍵詞提取、文字摘要、文字主題模型、文字聚類

ML.NET 示例：二元分類之信用卡欺詐檢測

寫在前面準備近期將微軟的machinelearning-samples翻譯成中文，水平有限，如有錯漏，請大家多多指正。如果有朋友對此感興趣，可以加入我：https://github.com/feiyun0112/machinelearning-samples.zh-cn 基於二元分類和PC

機器學習框架ML.NET學習筆記【2】入門之二元分類

一、準備樣本接上一篇文章提到的問題：根據一個人的身高、體重來判斷一個人的身材是否很好。但我手上沒有樣本資料，只能偽造一批資料了，偽造的資料比較標準，用來學習還是蠻合適的。下面是我用來偽造資料的程式碼： string Filename = "./figure_full.c

NLP情感分析之情感分類

情感分析與情感分類情感分析（sentiment analysis）是近年來國內外研究的熱點，其任務是幫助使用者快速獲取、整理和分析相關評價資訊，對帶有情感色彩的主觀性文字進行分析、處理、歸納和推理。情感分析包含較多的任務，如情感分類（sentiment classification）、觀

ML.NET 示例：多類分類之鳶尾花分類

寫在前面準備近期將微軟的machinelearning-samples翻譯成中文，水平有限，如有錯漏，請大家多多指正。如果有朋友對此感興趣，可以加入我：https://github.com/feiyun0112/machinelearning-samples.zh-cn 鳶尾花分類

情感分析方法之snownlp和貝葉斯分類器（三）

《情感分析方法之nltk情感分析器和SVM分類器（二）》主要使用nltk處理英文語料，使用SVM分類器處理中文語料。實際的新聞評論中既包含英文，又包含中文和阿拉伯文。本次主要使用snownlp處理中文語料。一、snownlp使用from snownlp import Snow

機器學習框架ML.NET學習筆記【4】多元分類之手寫數字識別

一、問題與解決方案通過多元分類演算法進行手寫數字識別，手寫數字的圖片解析度為8*8的灰度圖片、已經預先進行過處理，讀取了各畫素點的灰度值，並進行了標記。其中第0列是序號（不參與運算）、1-64列是畫素值、65列是結果。我們以64位畫素值為特徵進行多元分類，演算法採用SDCA最大熵分類演算法。

asp.net mvc+jquery easyui開發實戰教程之網站後臺管理系統開發2-Model層建立

ack 前端 strong syn eee 名稱 lar led tegra 上篇(asp.net mvc+jquery easyui開發實戰教程之網站後臺管理系統開發1-準備工作)文章講解了開發過程中的準備工作，主要創建了項目數據庫及項目，本文主要講解項目M層的實現，M層

spark scala word2vec 和多層分類感知器在情感分析中的實際應用

predict output edi ext oop post format vector spa 轉自：http://www.cnblogs.com/canyangfeixue/p/7227998.html 對於威脅檢測算法使用神經網絡訓練有用！！！TODO待實驗 /

ML.NET教程之情感分析(二元分類問題)

理解問題

準備資料

載入資料

提取特徵

訓練模型

評估模型

使用模型

完整示例程式碼

相關推薦