1.自動文字分類是對大量的非結構化的文字資訊(文字文件、網頁等)按照給定的分類體系,根據文字資訊內容分到指定的類別中去,是一種有指導的學習過程。
分類過程採用基於統計的方法和向量空間模型可以對常見的文字網頁資訊進行分類,分類的準確率可以達到85%以上。分類速度50篇/秒。
2.要想分類必須先分詞,進行文字分詞的文章連結常見的四種文字自動分詞詳解及IK Analyze的程式碼實現
3.廢話不多說直接貼程式碼,原理連結https://www.cnblogs.com/pinard/p/6069267.html
4.程式碼
- import org.apache.spark.{SparkConf, SparkContext}
- import org.apache.spark.ml.feature.HashingTF
- import org.apache.spark.ml.feature.IDF
- import org.apache.spark.ml.feature.Tokenizer
- import org.apache.spark.mllib.classification.NaiveBayes
- import org.apache.spark.mllib.linalg.Vector
- import org.apache.spark.mllib.linalg.Vectors
- import org.apache.spark.mllib.regression.LabeledPoint
- import org.apache.spark.sql.Row
- import scala.reflect.api.materializeTypeTag
- object TestNaiveBayes {
- case class RawDataRecord(category: String, text: String)
- def main(args : Array[String]) {
- /*val conf = new SparkConf().setMaster("yarn-client")
- val sc = new SparkContext(conf)*/
- val conf = new SparkConf().setMaster("local").setAppName("reduce")
- val sc = new SparkContext(conf)
- val sqlContext = new org.apache.spark.sql.SQLContext(sc)
- import sqlContext.implicits._
- var srcRDD = sc.textFile("C:/Users/dell/Desktop/大資料/分類細胞詞庫").map {
- x =>
- var data = x.split(",")
- RawDataRecord(data(),data())
- }
- var trainingDF = srcRDD.toDF()
- //將詞語轉換成陣列
- var tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
- var wordsData = tokenizer.transform(trainingDF)
- println("output1:")
- wordsData.select($"category",$"text",$"words").take().foreach(println)
- //計算每個詞在文件中的詞頻
- var hashingTF = new HashingTF().setNumFeatures().setInputCol("words").setOutputCol("rawFeatures")
- var featurizedData = hashingTF.transform(wordsData)
- println("output2:")
- featurizedData.select($"category", $"words", $"rawFeatures").take().foreach(println)
- //計算每個詞的TF-IDF
- var idf = new IDF().setInputCol("rawFeatures").setOutputCol("features")
- var idfModel = idf.fit(featurizedData)
- var rescaledData = idfModel.transform(featurizedData)
- println("output3:")
- rescaledData.select($"category", $"features").take().foreach(println)
- //轉換成Bayes的輸入格式
- var trainDataRdd = rescaledData.select($"category",$"features").map {
- case Row(label: String, features: Vector) =>
- LabeledPoint(label.toDouble, Vectors.dense(features.toArray))
- }
- println("output4:")
- trainDataRdd.take()
- //訓練熱詞資料
- val model = NaiveBayes.train(trainDataRdd, lambda = 1.0, modelType = "multinomial")
- var srcRDD1 = sc.textFile("C:/Users/dell/Desktop/大資料/熱詞細胞詞庫/熱詞資料1.txt").map {
- x =>
- var data = x.split(",")
- RawDataRecord(data(),data())
- }
- var testDF = srcRDD1.toDF()
- //將熱詞資料做同樣的特徵表示及格式轉換
- var testwordsData = tokenizer.transform(testDF)
- var testfeaturizedData = hashingTF.transform(testwordsData)
- var testrescaledData = idfModel.transform(testfeaturizedData)
- var testDataRdd = testrescaledData.select($"category",$"features").map {
- case Row(label: String, features: Vector) =>
- LabeledPoint(label.toDouble, Vectors.dense(features.toArray))
- }
- //對熱詞資料資料集使用訓練模型進行分類預測 訓練模型就是提前弄好的分類資料細胞集
- val testpredictionAndLabel = testDataRdd.map(p => (model.predict(p.features), p.label))
- println("output5:")
- testpredictionAndLabel.foreach(println)
- }
- }
程式碼網上找的好幾天前的了,找不到出處了,侵刪
找到了。https://blog.csdn.net/yumingzhu1/article/details/85064047
5.jar包依賴
可能不需要這麼多,自己甄別吧
需要什麼沒補充或者不懂得可以評論,因為太晚了,就寫到這樣吧