ML.NET教程之情感分析(二元分類問題)
機器學習的工作流程分為以下幾個步驟:
- 理解問題
- 準備資料
- 載入資料
- 提取特徵
- 構建與訓練
- 訓練模型
- 評估模型
- 執行
- 使用模型
理解問題
本教程需要解決的問題是根據網站內評論的意見採取合適的行動。
可用的訓練資料集中,網站評論可能是有毒(toxic)(1)或者無毒(not toxic)(0)兩種型別。這種場景下,機器學習中的分類任務最為適合。
分類任務用於區分資料內的類別(category),型別(type)或種類(class)。常見的例子有:
- 識別情感是正面或是負面
- 將郵件按照是否為垃圾郵件歸類
- 判定病人的實驗室樣本是否為癌症
- 按照客戶的偏好進行分類以響應銷售活動
分類任務可以是二元又或是多元的。這裡面臨的是二元分類的問題。
準備資料
首先建立一個控制檯應用程式,基於.NET Core。完成搭建後,新增Microsoft.ML類庫包。接著在工程下新建名為Data
的資料夾。
之後,下載WikiPedia-detox-250-line-data.tsv與wikipedia-detox-250-line-test.tsv檔案,並將它們放入Data
資料夾,值得注意的是,這兩個檔案的Copy to Output Directory
屬性需要修改成Copy if newer
。
載入資料
在Program.cs
檔案的Main
方法里加入以下程式碼:
MLContext mlContext = new MLContext(seed: 0); _textLoader = mlContext.Data.TextReader(new TextLoader.Arguments() { Separator = "tab", HasHeader = true, Column = new[] { new TextLoader.Column("Label", DataKind.Bool, 0), new TextLoader.Column("SentimentText", DataKind.Text, 1) } });
其目的是通過使用TextLoader類為資料的載入作好準備。
Column屬性中構建了兩個物件,即對應資料集中的兩列資料。不過第一列這裡必須使用Label
而不是Sentiment
。
提取特徵
新建一個SentimentData.cs
檔案,其中加入SentimentData類與SentimentPrediction。
public class SentimentData
{
[Column(ordinal: "0", name: "Label")]
public float Sentiment;
[Column(ordinal: "1")]
public string SentimentText;
}
public class SentimentPrediction
{
[ColumnName("PredictedLabel")]
public bool Prediction { get; set; }
[ColumnName("Probability")]
public float Probability { get; set; }
[ColumnName("Score")]
public float Score { get; set; }
}
SentimentData類中的SentimentText為輸入資料集的特徵,Sentiment則是資料集的標記(label)。
SentimentPrediction類用於模型被訓練後的預測。
訓練模型
在Program
類中加入Train
方法。首先它會讀取訓練資料集,接著將特徵列中的文字型資料轉換為浮點型陣列並設定了訓練時所使用的決策樹二元分類模型。之後,即是實際訓練模型。
public static ITransformer Train(MLContext mlContext, string dataPath)
{
IDataView dataView = _textLoader.Read(dataPath);
var pipeline = mlContext.Transforms.Text.FeaturizeText("SentimentText", "Features")
.Append(mlContext.BinaryClassification.Trainers.FastTree(numLeaves: 50, numTrees: 50, minDatapointsInLeaves: 20));
Console.WriteLine("=============== Create and Train the Model ===============");
var model = pipeline.Fit(dataView);
Console.WriteLine("=============== End of training ===============");
Console.WriteLine();
return model;
}
評估模型
加入Evaluate
方法。到了這一步,需要讀取的是用於測試的資料集,且讀取後的資料仍然需要轉換成合適的資料型別。
public static void Evaluate(MLContext mlContext, ITransformer model)
{
IDataView dataView = _textLoader.Read(_testDataPath);
Console.WriteLine("=============== Evaluating Model accuracy with Test data===============");
var predictions = model.Transform(dataView);
var metrics = mlContext.BinaryClassification.Evaluate(predictions, "Label");
Console.WriteLine();
Console.WriteLine("Model quality metrics evaluation");
Console.WriteLine("--------------------------------");
Console.WriteLine($"Accuracy: {metrics.Accuracy:P2}");
Console.WriteLine($"Auc: {metrics.Auc:P2}");
Console.WriteLine($"F1Score: {metrics.F1Score:P2}");
Console.WriteLine("=============== End of model evaluation ===============");
}
使用模型
訓練及評估模型完成後,就可以正式使用它了。這裡需要建立一個用於預測的物件(PredictionFunction
),其預測方法的輸入引數是SentimentData
型別,返回結果為SentimentPrediction
型別。
private static void Predict(MLContext mlContext, ITransformer model)
{
var predictionFunction = model.MakePredictionFunction<SentimentData, SentimentPrediction>(mlContext);
SentimentData sampleStatement = new SentimentData
{
SentimentText = "This is a very rude movie"
};
var resultprediction = predictionFunction.Predict(sampleStatement);
Console.WriteLine();
Console.WriteLine("=============== Prediction Test of model with a single sample and test dataset ===============");
Console.WriteLine();
Console.WriteLine($"Sentiment: {sampleStatement.SentimentText} | Prediction: {(Convert.ToBoolean(resultprediction.Prediction) ? "Toxic" : "Not Toxic")} | Probability: {resultprediction.Probability} ");
Console.WriteLine("=============== End of Predictions ===============");
Console.WriteLine();
}
完整示例程式碼
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using Microsoft.ML;
using Microsoft.ML.Core.Data;
using Microsoft.ML.Runtime.Data;
using Microsoft.ML.Transforms.Text;
namespace SentimentAnalysis
{
class Program
{
static readonly string _trainDataPath = Path.Combine(Environment.CurrentDirectory, "Data", "wikipedia-detox-250-line-data.tsv");
static readonly string _testDataPath = Path.Combine(Environment.CurrentDirectory, "Data", "wikipedia-detox-250-line-test.tsv");
static readonly string _modelPath = Path.Combine(Environment.CurrentDirectory, "Data", "Model.zip");
static TextLoader _textLoader;
static void Main(string[] args)
{
MLContext mlContext = new MLContext(seed: 0);
_textLoader = mlContext.Data.TextReader(new TextLoader.Arguments()
{
Separator = "tab",
HasHeader = true,
Column = new[]
{
new TextLoader.Column("Label", DataKind.Bool, 0),
new TextLoader.Column("SentimentText", DataKind.Text, 1)
}
});
var model = Train(mlContext, _trainDataPath);
Evaluate(mlContext, model);
Predict(mlContext, model);
Console.Read();
}
public static ITransformer Train(MLContext mlContext, string dataPath)
{
IDataView dataView = _textLoader.Read(dataPath);
var pipeline = mlContext.Transforms.Text.FeaturizeText("SentimentText", "Features")
.Append(mlContext.BinaryClassification.Trainers.FastTree(numLeaves: 50, numTrees: 50, minDatapointsInLeaves: 20));
Console.WriteLine("=============== Create and Train the Model ===============");
var model = pipeline.Fit(dataView);
Console.WriteLine("=============== End of training ===============");
Console.WriteLine();
return model;
}
public static void Evaluate(MLContext mlContext, ITransformer model)
{
IDataView dataView = _textLoader.Read(_testDataPath);
Console.WriteLine("=============== Evaluating Model accuracy with Test data===============");
var predictions = model.Transform(dataView);
var metrics = mlContext.BinaryClassification.Evaluate(predictions, "Label");
Console.WriteLine();
Console.WriteLine("Model quality metrics evaluation");
Console.WriteLine("--------------------------------");
Console.WriteLine($"Accuracy: {metrics.Accuracy:P2}");
Console.WriteLine($"Auc: {metrics.Auc:P2}");
Console.WriteLine($"F1Score: {metrics.F1Score:P2}");
Console.WriteLine("=============== End of model evaluation ===============");
}
private static void Predict(MLContext mlContext, ITransformer model)
{
var predictionFunction = model.MakePredictionFunction<SentimentData, SentimentPrediction>(mlContext);
SentimentData sampleStatement = new SentimentData
{
SentimentText = "This is a very rude movie"
};
var resultprediction = predictionFunction.Predict(sampleStatement);
Console.WriteLine();
Console.WriteLine("=============== Prediction Test of model with a single sample and test dataset ===============");
Console.WriteLine();
Console.WriteLine($"Sentiment: {sampleStatement.SentimentText} | Prediction: {(Convert.ToBoolean(resultprediction.Prediction) ? "Toxic" : "Not Toxic")} | Probability: {resultprediction.Probability} ");
Console.WriteLine("=============== End of Predictions ===============");
Console.WriteLine();
}
}
}
程式執行後顯示的結果:
=============== Create and Train the Model ===============
=============== End of training ===============
=============== Evaluating Model accuracy with Test data===============
Model quality metrics evaluation
--------------------------------
Accuracy: 83.33%
Auc: 98.77%
F1Score: 85.71%
=============== End of model evaluation ===============
=============== Prediction Test of model with a single sample and test dataset ===============
Sentiment: This is a very rude movie | Prediction: Toxic | Probability: 0.7387648
=============== End of Predictions ===============
可以看到在預測This is a very rude movie
(這是一部粗製濫造的電影)這句評論時,模型判定其是有毒的:-)