Spark垃圾郵件分類(scala+java)

阿新 • • 發佈：2017-12-22

name pac algorithm over email @override logistic es2017 AMF

Java程序

import java.util.Arrays;

import org.apache.spark.SparkConf;

import org.apache.spark.api.java.JavaRDD;

import org.apache.spark.api.java.JavaSparkContext;

import org.apache.spark.api.java.function.Function;

import org.apache.spark.mllib.classification.LogisticRegressionModel;

import org.apache.spark.mllib.classification.LogisticRegressionWithSGD;

import org.apache.spark.mllib.feature.HashingTF;

import org.apache.spark.mllib.linalg.Vector;

import org.apache.spark.mllib.regression.LabeledPoint;

/**

* Created by hui on 2017/11/29.

public class MLlib {

public static void main(String[] args) {

SparkConf sparkConf = new SparkConf().setAppName("JavaBookExample").setMaster("local");

JavaSparkContext sc = new JavaSparkContext(sparkConf);

// Load 2 types of emails from text files: spam and ham (non-spam).

// Each line has text from one email.

JavaRDD<String> spam = sc.textFile("files/spam.txt");

JavaRDD<String> ham = sc.textFile("files/ham.txt");

// Create a HashingTF instance to map email text to vectors of 100 features.

final HashingTF tf = new HashingTF(100);

// Each email is split into words, and each word is mapped to one feature.

// Create LabeledPoint datasets for positive (spam) and negative (ham) examples.

JavaRDD<LabeledPoint> positiveExamples = spam.map(new Function<String, LabeledPoint>() {

@Override public LabeledPoint call(String email) {

return new LabeledPoint(1, tf.transform(Arrays.asList(email.split(" "))));

}

});

JavaRDD<LabeledPoint> negativeExamples = ham.map(new Function<String, LabeledPoint>() {

@Override public LabeledPoint call(String email) {

return new LabeledPoint(0, tf.transform(Arrays.asList(email.split(" "))));

}

});

JavaRDD<LabeledPoint> trainingData = positiveExamples.union(negativeExamples);

trainingData.cache(); // Cache data since Logistic Regression is an iterative algorithm.

// Create a Logistic Regression learner which uses the LBFGS optimizer.

LogisticRegressionWithSGD lrLearner = new LogisticRegressionWithSGD();

// Run the actual learning algorithm on the training data.

LogisticRegressionModel model = lrLearner.run(trainingData.rdd());

// Test on a positive example (spam) and a negative one (ham).

// First apply the same HashingTF feature transformation used on the training data.

Vector posTestExample =

tf.transform(Arrays.asList("O M G GET cheap stuff by sending money to ...".split(" ")));

Vector negTestExample =

tf.transform(Arrays.asList("Hi Dad, I started studying Spark the other ...".split(" ")));

// Now use the learned model to predict spam/ham for new emails.

System.out.println("Prediction for positive test example: " + model.predict(posTestExample));

System.out.println("Prediction for negative test example: " + model.predict(negTestExample));

sc.stop();

}

Scala程序

import org.apache.spark.mllib.classification.LogisticRegressionWithSGD

import org.apache.spark.mllib.feature.HashingTF

import org.apache.spark.mllib.regression.LabeledPoint

import org.apache.spark.{SparkConf, SparkContext}

/**

* Created by hui on 2017/11/23.

object email {

def main(args:Array[String]): Unit = {

val conf = new SparkConf().setAppName(s"Book example: Scala").setMaster("local")

val sc = new SparkContext(conf)

// Load 2 types of emails from text files: spam and ham (non-spam).

// Each line has text from one email.

val spam = sc.textFile("files/spam.txt")

val ham = sc.textFile("files/ham.txt")

// Create a HashingTF instance to map email text to vectors of 100 features.

val tf = new HashingTF(numFeatures = 100)

// Each email is split into words, and each word is mapped to one feature.

val spamFeatures = spam.map(email => tf.transform(email.split(" ")))

val hamFeatures = ham.map(email => tf.transform(email.split(" ")))

// Create LabeledPoint datasets for positive (spam) and negative (ham) examples.

val positiveExamples = spamFeatures.map(features => LabeledPoint(1, features))

val negativeExamples = hamFeatures.map(features => LabeledPoint(0, features))

val trainingData = positiveExamples ++ negativeExamples

trainingData.cache() // Cache data since Logistic Regression is an iterative algorithm.

// Create a Logistic Regression learner which uses the LBFGS optimizer.

val lrLearner = new LogisticRegressionWithSGD()

// Run the actual learning algorithm on the training data.

val model = lrLearner.run(trainingData)

// Test on a positive example (spam) and a negative one (ham).

// First apply the same HashingTF feature transformation used on the training data.

val posTestExample = tf.transform("O M G GET cheap stuff by sending money to ...".split(" "))

val negTestExample = tf.transform("Hi Dad, I started studying Spark the other ...".split(" "))

// Now use the learned model to predict spam/ham for new emails.

println(s"Prediction for positive test example: ${model.predict(posTestExample)}")

println(s"Prediction for negative test example: ${model.predict(negTestExample)}")

sc.stop()

}

運行結果

Spark垃圾郵件分類(scala+java)

name pac algorithm over email @override logistic es2017 AMF Java程序 import java.util.Arrays; import org.apache.spark.SparkConf; im

【Spark Mllib】邏輯迴歸——垃圾郵件分類器與maven構建獨立專案

Dear Spark Learner, Thanks so much for attending the Spark Summit 2014! Check out videos of talks from the summit at ... Hi Mom, Apologies for being late

利用樸素貝葉斯（Navie Bayes）進行垃圾郵件分類

判斷 ase create numpy water 向量 not in imp img 貝葉斯公式描寫敘述的是一組條件概率之間相互轉化的關系。在機器學習中。貝葉斯公式能夠應用在分類問題上。這篇文章是基於自己的學習所整理。並利用一個垃圾郵件分類的樣例來加深對於理論的理解

樸素貝葉斯應用：垃圾郵件分類

import nltk nltk.download() from nltk.corpus import stopwords from nltk.stem import WordNetLemmatizer #預處理 def preprocessing(text): tokens

樸素貝葉斯應用：垃圾郵件分類(更新)

#讀取資料集 import csv file_path=r'jiangnan.txt' sms=open(file_path,'r',encoding='utf-8') sms_data=[] sms_label=[] text=csv.reader(sms,delimiter='\t') text

第十二次作業——樸素貝葉斯應用：垃圾郵件分類

text = "Everybody knows waste paper and used coke cans are discarded everywhere. You might have seen plastic bags flying in the sky and getting caught i

垃圾郵件分類

tokenize 郵件 ext read utf-8 spl 指標 form odin import nltk from nltk.corpus import stopwords from nltk.stem import WordNetLemmatizer #預

【python與機器學習入門3】樸素貝葉斯2——垃圾郵件分類

參考部落格：樸素貝葉斯基礎篇之言論過濾器（po主Jack-Cui,《——大部分內容轉載自參考書籍：《機器學習實戰》——第四章4.6

機器學習之樸素貝葉斯（附垃圾郵件分類）

樸素貝葉斯分類器介紹概述樸素貝葉斯分類器技術基於貝葉斯定理，特別適用於輸入維數較高的情況。儘管樸素貝葉斯方法簡單，但它通常比更復雜的分類方法更勝一籌。

吳恩達-機器學習(6)-評估學習演算法、偏差與方差、構架垃圾郵件分類器、處理傾斜資料

文章目錄 Evaluating a Learing Algorithm Decidding what to try next Evaluating your hypothesis Bias

Scala 大資料Spark生態圈必備 Scala+Java混編

Scala 大資料Spark生態圈必備 Scala+Java混編連結: https://pan.baidu.com/s/1AO2nVZdaSRZf8d8LRE3_2Q 提取碼: 7hbe 第1章初識Scala 瞭解Scala是什麼,學習Scala的意義

交叉驗證原理及Spark MLlib使用例項(Scala/Java/Python)

交叉驗證方法思想： CrossValidator將資料集劃分為若干子集分別地進行訓練和測試。如當k＝3時，CrossValidator產生3個訓練資料與測試資料對，每個資料對使用2/3的資料來訓練，1/3的資料來測試。對於一組特定的引數表，CrossVali

兩種模型選擇和超引數調整方法及Spark MLlib使用示例(Scala/Java/Python)

機器學習除錯：模型選擇和超引數調整模型選擇（又名超引數調整）在機器學習中非常重要的任務就是模型選擇，或者使用資料來找到具體問題的最佳的模型和引數，這個過程也叫做除錯。除錯可以在獨立的如邏輯迴歸等估計器中完成，也可以在包含多樣演算法、特徵工程和其他步驟的管線

CNN英文垃圾郵件分類（資料預處理）

整理自唐宇迪老師的視訊課程，感謝他！本文最後會貼出所有的原始碼檔案，下文只是針對每個小點貼出程式碼進行註釋說明，可以略過。 1.思路關於利用CNN做文字分類，其主要思想通過下面這幅圖就能夠一目瞭然。本文主要記錄了利用CNN來分類英文垃圾郵件的全

文字分類：垃圾郵件分類

文字挖掘（Text Mining，從文字中獲取資訊）是一個比較寬泛的概念，這一技術在如今每天都有海量文字資料生成的時代越來越受到關注。目前，在機器學習模型的幫助下，包括情緒分析，檔案分類，話題分類，文字總結，機器翻譯等在內的諸多文字挖掘應用都已經實現了自動化。　　

垃圾郵件分類器的原理（1）

學習完了斯坦福大學《機器學習》第7周課程，做完程式設計作業垃圾郵件分類器，準備分享下實現原理和實現方法，對自己也是起到總結作用，對博友是個參考，估計得寫好幾篇才能講完，這是第一篇。先看執行結果：訓練樣本有4000個，測試樣本有1000個，結果顯示判斷準確率都在98%以上

二分K均值演算法原理及Spark MLlib呼叫例項(Scala/Java/Python)

二分K均值演算法演算法介紹：二分K均值演算法是一種層次聚類演算法，使用自頂向下的逼近：所有的觀察值開始是一個簇，遞迴地向下一個層級分裂。分裂依據為選擇能最大程度降低聚類代價函式（也就是誤差平方和）的簇劃分為兩個簇。以此進行下去，直到簇的數目等於使用者給定的數目k為止。二

CNN中文垃圾郵件分類（二）

本文整理自唐宇迪老師視訊，謝謝他！ 1.思路在上一篇部落格CNN中文垃圾郵件分類（一）中介紹了兩種預處理方式，現在來介紹第二種，先用分好詞的資料作為訓練語料，選擇前n個詞作為詞表（或者去掉出現頻率

文字處理之貝葉斯垃圾郵件分類

本文所講解的是如何通過Python將文字讀取,並且將每一個文字生成對應的詞向量並返回. 文章的背景是將50封郵件(包含25封正常郵件,25封垃圾郵件)通過貝葉斯演算法對其進行分類. 主要分為如下幾個部分: ①讀取所有郵件; ②建立詞彙表; ③生成沒封郵件對應的詞

python實現貝葉斯推斷——垃圾郵件分類

理論前期準備資料來源資料來源於《機器學習實戰》中的第四章樸素貝葉斯分類器的實驗資料。資料書上只提供了50條資料（25條正常郵件，25條垃圾郵件），感覺資料量偏小，以後打算使用scikit-learn提供的iris資料。資料準備和很

Spark垃圾郵件分類(scala+java)

相關推薦