Hadoop學習記錄（三、MapReduce）

阿新 • • 發佈：2018-12-06

1.將一個日誌檔案上傳到hdfs上

2. 編寫mapReduce程式碼

2.1新建一個maven專案，新增依賴

<dependencies>
      <dependency>
          <groupId>org.apache.hadoop</groupId>
          <artifactId>hadoop-core</artifactId>
          <version>1.2.1</version>
      </dependency>
  </dependencies>

2.2編寫HotSearch類

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

/**
 * mapReduce功能演示：《我是歌手》熱搜榜
 *
 * @author lrn
 * @createTime : 2018/11/30 19:03
 */
public class HotSearch {
    public static class HotSearchMap extends Mapper<Object, Text, Text, IntWritable> {
        @Override
        protected void map(Object key, Text value, Context context) throws IOException, InterruptedException {
//            文中的一行資料
            String currentLine = value.toString();
//            如果當前行中出現歌手的名字，則對應歌手的統計數量+1
            if (currentLine.contains("黃致列")) {
                context.write(new Text("黃致列"), new IntWritable(1));
            } else if (currentLine.contains("李玟") || currentLine.contains("COCO")) {
                context.write(new Text("李玟"), new IntWritable(1));
            } else if (currentLine.contains("張信哲")) {
                context.write(new Text("張信哲"), new IntWritable(1));
            } else if (currentLine.contains("趙傳")) {
                context.write(new Text("趙傳"), new IntWritable(1));
            } else if (currentLine.contains("老狼")) {
                context.write(new Text("老狼"), new IntWritable(1));
            }
        }
    }

    public static class HotSearchReduce extends Reducer<Text, IntWritable, Text, IntWritable> {
        @Override
        protected void reduce(Text key, Iterable<IntWritable> values,
                              Reducer<Text, IntWritable, Text, IntWritable>.Context context)
                throws IOException, InterruptedException {
            int count = 0;
//            對map方法中輸出的統計資料進行彙總
            for (IntWritable intWritable : values) {
                count += intWritable.get();
            }
//            輸出該reduce的彙總資料
            context.write(key, new IntWritable(count));
        }
    }

    public static void main(String[] args) throws Exception {
//        取得一個任務物件
        Job job = Job.getInstance();

        job.setJarByClass(HotSearch.class);
        job.setMapperClass(HotSearchMap.class);
        job.setReducerClass(HotSearchReduce.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

//        設定任務的輸入檔案或路徑
        FileInputFormat.addInputPath(job, new Path(args[0]));
//        設定任務的輸出路徑
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

//        啟動任務
        job.waitForCompletion(true);
    }
}

2.3打包

mvn clean,mvn install ,mvn package打成jar包

3.hdfs執行

3.1將jar包傳到Linux上

3.2啟動hdfs

在sbin目錄下執行

./start-dfs.sh

3.3啟動yarn

./start-yarn.sh

3.4執行mapReduce

./hadoop jar /tmp/mapReduce-1.0-SNAPSHOT.jar HotSearch /input/IAMSinger.txt /output2

命令解讀：./hadoop jar +jar包在Linux的路徑 +jar包main方法所在類（路徑）+hdfs上的待分析檔案路徑+hdfs分析結果路徑

File Output Format對應的Bytes若為0，則表示無輸出內容

3.5檢視分析結果

./hdfs dfs -cat /output2/part-r-00000

Hadoop學習記錄（三、MapReduce）

1.將一個日誌檔案上傳到hdfs上 2. 編寫mapReduce程式碼 2.1新建一個maven專案，新增依賴 <dependencies> <dependency> <groupId>

Hadoop學習記錄（七、MapReduce檔案分解與合成）

1.將若干個小檔案打包成順序檔案 public class SmallFilesToSequenceFileConverter extends Configured implements Tool { static class SequenceFileMapper

Hadoop學習記錄（六、MapReduce測試）

1.MRUnit進行單元測試加入依賴 <dependency> <groupId>org.apache.mrunit</groupId> <artifactId>mrunit&l

Hadoop學習記錄（五、hadoop IO操作）

1.壓縮從標準輸入讀取的資料，然後將其寫到標準輸出通過GzipCodec的StreamCompressor物件對字串“Text”進行壓縮，再使用gunzip從標準輸出中對它進行讀取並解壓縮 public class StreamCompressor { public static

Hadoop學習記錄（四、hadoop實現檔案操作）

1.從Hadoop URL讀取資料類似cat命令 public class URLCat { static{ URL.setURLStreamHandlerFactory(new FsUrlStreamHandlerFactory()); }

Hadoop學習記錄（二、hdfs shell命令）

在/usr/local/hadoop-2.9.2/bin目錄下執行命令 1.檢視根目錄： ./hdfs dfs -ls / 2.檔案上傳：上傳到根目錄 ./hdfs dfs -put /tmp/test.txt / 3.檢視檔案內容 ./hdfs df

Hadoop學習記錄（一、Hadoop叢集的搭建）

參考：http://www.zuidemo.com/filePreview/pdfFilePreview/11202並進行補充 1.新建七個centos7系統的虛擬機器，分別命名為cluster1,cluster2等。關閉防火牆。 2.七臺主機都修改host檔案 vi /etc/host

Storm學習記錄（三、Storm叢集搭建）

一、單機搭建 1.上傳並解壓jar包 2.在storm目錄下建立logs目錄，以儲存程式執行時的資訊 mkdir logs 3.在bin目錄下執行命令，啟動zookeeper ./storm dev-zookeeper >> ../logs/dev-zookeeper

spark學習記錄（三、spark叢集搭建）

一、安裝spark 1.上傳壓縮包並解壓 2.在conf目錄下配置slaves cp slaves.template slaves //在master機上配置worker節點 hadoop2 hadoop3 3.配置spark-env.sh cp spark-env.sh.t

spark學習記錄（二、RDD）

一、概念 RDD（Resilient Distributed Dataset）叫做彈性分散式資料集，是Spark中最基本的資料抽象，它代表一個不可變、可分割槽、裡面的元素可平行計算的集合。RDD具有資料流模型的特點：自動容錯、位置感知性排程和可伸縮性。RDD允許使用者在執行多個查詢時顯式地將工作

Storm學習記錄（一、簡介）

一、簡介 Storm是一個免費並開源的分散式實時計算系統。利用Storm可以很容易做到可靠地處理無限的資料流，像Hadoop批量處理大資料一樣，Storm可以實時處理資料。Storm簡單，可以使用任何程式語言。 Storm有如下特點：程式設計簡單：開發人員只需要關注應用邏輯，而且

spark學習記錄（十三、SparkStreaming）

一、SparkStreaming簡介 SparkStreaming是流式處理框架，是Spark API的擴充套件，支援可擴充套件、高吞吐量、容錯的實時資料流處理，實時資料的來源可以是：Kafka, Flume, Twitter, ZeroMQ或者TCP sockets，並且可以使用高階功能的複雜

spark學習記錄（十、SparkSQL）

一、介紹 SparkSQL支援查詢原生的RDD。 RDD是Spark平臺的核心概念，是Spark能夠高效的處理大資料的各種場景的基礎。能夠在Scala中寫SQL語句。支援簡單的SQL語法檢查，能夠在Scala中寫Hive語句訪問Hive資料，並將結果取回作為RDD使用。 D

HBase權威指南學習記錄（五、hbase與MapReduce整合）

新增依賴： <dependency> <groupId>org.apache.hbase</groupId> <artifactId>hbase-client</artifact

JAVA學習記錄（三）——Java 流(Stream)、檔案(File)和IO

簡介 Java.io包幾乎包含了所有操作輸入、輸出需要的類。所有這些流類代表了輸入源和輸出目標。 Java.io包中的流支援很多種格式，比如：基本型別、物件、本地化字符集等等。一個流可以理解為一個數據的序列。輸入流表示從一

BigData 學習記錄（三）

如果 lock 都沒有 stream 節點信息 nod 存在物理 master/slave主從結構： HDFS是一個 master/slave的架構。HDFS只有一個NameNode，即master。master負責管理文件系統命名空間和client對文件的訪問。此外，

JavaScript學習總結（三、函數聲明和表達式、this、閉包和引用、arguments對象、函數間傳遞參數）

rem [1] incr foo i++ scrip erro ren 推薦一、函數聲明和表達式函數聲明： function test() {}; test(); //運行正常 function test() {}; 函數表達式： var test = fun

Uferryman FCC學習記錄（三）—— jQuery

dto query ndt 子節點 nth 註意 tex cnblogs app 1.jQuery基本了解：　　JQuery是一個開源的JavaScript庫，創始人John Resig 2.jQuery的開始準備：　　$(document).ready(functio

Linux學習筆記（三十二）iptables filter表案例、 iptables nat表應用

iptables filter表案例、 iptables nat表應用一、iptables filter表案例需求：將80、20、21端口放行，對22端口指定特定的ip才放行以下為操作方法：vim /usr/local/sbin/iptables.sh //加入如下內容#! /bin/bas

CentOS初步學習記錄（三）Wget文件下載和Sed文件處理

下載速度 file 文件 socket cut inux 調試數據行 use mozilla 一、wget 命令 wget命令用來從指定的URL下載文件，wget非常穩定，它在帶寬很窄的情況下和不穩定網絡中有很強的適應性，如果是由於網絡的原因下載失敗，wget會不斷

Hadoop學習記錄（三、MapReduce）

1.將一個日誌檔案上傳到hdfs上

2. 編寫mapReduce程式碼

3.hdfs執行

相關推薦