1. 程式人生 > >Hadoop之MapReduce過程,單詞計數WordCount

Hadoop之MapReduce過程,單詞計數WordCount

單詞計數是最簡單也是最能體現MapReduce思想的程式之一,可以稱為MapReduce版“Hello World”,該程式的完整程式碼可以在Hadoop安裝包的src/example目錄下找到。單詞計數主要完成的功能:統計一系列文字檔案中每個單詞出現的次數,如下圖所示。
在這裡插入圖片描述

WordCount過程
1)將檔案拆分成splits,由於測試用的檔案較小,所以每個檔案為一個split,並將檔案按行分割形成< key,value >對,如圖所示。這一步由MapReduce框架自動完成,其中偏移量(即key值)包括了回車符所佔的字元數(Windows和Linux環境下會不同)
在這裡插入圖片描述


2)將分割好的< key,value>對交給使用者定義的map方法進行處理,生成新的< key,value >對,如圖所示:
在這裡插入圖片描述
3)得到map方法輸出的< key,value>對後,Mapper會將它們按照key值進行排序,並執行Combine過程,將key值相同的value值累加,得到Mapper的最終輸出結果,如圖所示:
在這裡插入圖片描述
4) Reducer先對從Mapper接收的資料進行排序,再交由使用者自定義的reducer方法進行處理,得到新的< key,value>對,並作為WordCount的輸出結果,如圖所示:
在這裡插入圖片描述

例項:
1.建立hdfs目錄

[[email protected]] ~$ hadoop fs -mkdir /input

2.建立本地檔案file1.txt和file2.txt

[[email protected]] ~$ echo "hello world" > file1.txt
[[email protected]] ~$ echo "hello hadoop" > file2.txt

3.上傳輸入檔案至hdfs

[[email protected]] ~$ hadoop fs -put file*.txt /input

3.執行jar應用,(output目錄不用提前建立)

[[email protected]] /usr/local/hadoop/sbin$ hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.6.jar  wordcount /input /output
18/11/11 21:48:52 INFO client.RMProxy: Connecting to ResourceManager at master.hanli.com/192.168.255.130:8032
18/11/11 21:48:53 INFO input.FileInputFormat: Total input paths to process : 2
18/11/11 21:48:53 INFO mapreduce.JobSubmitter: number of splits:2
18/11/11 21:48:54 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1541944054972_0001
18/11/11 21:48:54 INFO impl.YarnClientImpl: Submitted application application_1541944054972_0001
18/11/11 21:48:55 INFO mapreduce.Job: The url to track the job: http://master.hanli.com:8088/proxy/application_1541944054972_0001/
18/11/11 21:48:55 INFO mapreduce.Job: Running job: job_1541944054972_0001
18/11/11 21:49:07 INFO mapreduce.Job: Job job_1541944054972_0001 running in uber mode : false
18/11/11 21:49:07 INFO mapreduce.Job:  map 0% reduce 0%
18/11/11 21:49:20 INFO mapreduce.Job:  map 100% reduce 0%
18/11/11 21:49:27 INFO mapreduce.Job:  map 100% reduce 100%
18/11/11 21:49:28 INFO mapreduce.Job: Job job_1541944054972_0001 completed successfully
18/11/11 21:49:28 INFO mapreduce.Job: Counters: 49
	File System Counters
		FILE: Number of bytes read=55
		FILE: Number of bytes written=368179
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=243
		HDFS: Number of bytes written=25
		HDFS: Number of read operations=9
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
	Job Counters 
		Launched map tasks=2
		Launched reduce tasks=1
		Data-local map tasks=2
		Total time spent by all maps in occupied slots (ms)=21571
		Total time spent by all reduces in occupied slots (ms)=4843
		Total time spent by all map tasks (ms)=21571
		Total time spent by all reduce tasks (ms)=4843
		Total vcore-milliseconds taken by all map tasks=21571
		Total vcore-milliseconds taken by all reduce tasks=4843
		Total megabyte-milliseconds taken by all map tasks=22088704
		Total megabyte-milliseconds taken by all reduce tasks=4959232
	Map-Reduce Framework
		Map input records=2
		Map output records=4
		Map output bytes=41
		Map output materialized bytes=61
		Input split bytes=218
		Combine input records=4
		Combine output records=4
		Reduce input groups=3
		Reduce shuffle bytes=61
		Reduce input records=4
		Reduce output records=3
		Spilled Records=8
		Shuffled Maps =2
		Failed Shuffles=0
		Merged Map outputs=2
		GC time elapsed (ms)=370
		CPU time spent (ms)=1780
		Physical memory (bytes) snapshot=492568576
		Virtual memory (bytes) snapshot=6234935296
		Total committed heap usage (bytes)=264638464
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=25
	File Output Format Counters 
		Bytes Written=25

4.檢視結果


[[email protected]] /usr/local/hadoop/sbin$ hadoop fs -ls /output
Found 2 items
-rw-r--r--   3 root supergroup          0 2018-11-11 21:49 /output/_SUCCESS
-rw-r--r--   3 root supergroup         25 2018-11-11 21:49 /output/part-r-00000

[[email protected]] /usr/local/hadoop/sbin$ hadoop fs -cat /output/part-r-00000
hadoop	1
hello	2
world	1

報錯

卡住不動,
檢視日誌
節點日誌:/usr/local/hadoop/logs/yarn-root-nodemanager-slave1.hanli.com.log
master資源排程器日誌:/usr/local/hadoop/logs/yarn-root-resourcemanager-master.hanli.com.log

參考文章 https://www.codetd.com/article/132972,https://www.cnblogs.com/xiangyangzhu/p/5711549.html

[[email protected]] /usr/local/hadoop/sbin$ hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.6.jar  wordcount /input /output
18/11/11 21:26:50 INFO client.RMProxy: Connecting to ResourceManager at /192.168.255.130:8032
18/11/11 21:26:52 INFO input.FileInputFormat: Total input paths to process : 2
18/11/11 21:26:53 INFO mapreduce.JobSubmitter: number of splits:2
18/11/11 21:26:53 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1541942707552_0001
18/11/11 21:26:53 INFO impl.YarnClientImpl: Submitted application application_1541942707552_0001
18/11/11 21:26:53 INFO mapreduce.Job: The url to track the job: http://master.hanli.com:8088/proxy/application_1541942707552_0001/
18/11/11 21:26:53 INFO mapreduce.Job: Running job: job_1541942707552_0001

最後解決辦法:在yare-site.xml裡新增如下資訊之後問題得到解決

<property>  

    <name>yarn.resourcemanager.address</name>  

    <value>master:8032</value>  

  </property>  

  <property>  

    <name>yarn.resourcemanager.scheduler.address</name>  

    <value>master:8030</value>  

  </property>  

  <property>  

    <name>yarn.resourcemanager.resource-tracker.address</name>  

    <value>master:8031</value>  

  </property>

(注意我將master、slave1、slave2這個檔案都修改了,是不是隻修改master就可以,不清楚,但是初步判斷應該全部修改)
另外有人提到上述配置是預設的,這就不得而知了,總之問題是解決了~