Hadoop之MapReduce過程,單詞計數WordCount
單詞計數是最簡單也是最能體現MapReduce思想的程式之一,可以稱為MapReduce版“Hello World”,該程式的完整程式碼可以在Hadoop安裝包的src/example目錄下找到。單詞計數主要完成的功能:統計一系列文字檔案中每個單詞出現的次數,如下圖所示。
WordCount過程
1)將檔案拆分成splits,由於測試用的檔案較小,所以每個檔案為一個split,並將檔案按行分割形成< key,value >對,如圖所示。這一步由MapReduce框架自動完成,其中偏移量(即key值)包括了回車符所佔的字元數(Windows和Linux環境下會不同)
2)將分割好的< key,value>對交給使用者定義的map方法進行處理,生成新的< key,value >對,如圖所示:
3)得到map方法輸出的< key,value>對後,Mapper會將它們按照key值進行排序,並執行Combine過程,將key值相同的value值累加,得到Mapper的最終輸出結果,如圖所示:
4) Reducer先對從Mapper接收的資料進行排序,再交由使用者自定義的reducer方法進行處理,得到新的< key,value>對,並作為WordCount的輸出結果,如圖所示:
例項:
1.建立hdfs目錄
[[email protected]] ~$ hadoop fs -mkdir /input
2.建立本地檔案file1.txt和file2.txt
[[email protected]] ~$ echo "hello world" > file1.txt
[[email protected]] ~$ echo "hello hadoop" > file2.txt
3.上傳輸入檔案至hdfs
[[email protected]] ~$ hadoop fs -put file*.txt /input
3.執行jar應用,(output目錄不用提前建立)
[[email protected]] /usr/local/hadoop/sbin$ hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.6.jar wordcount /input /output
18/11/11 21:48:52 INFO client.RMProxy: Connecting to ResourceManager at master.hanli.com/192.168.255.130:8032
18/11/11 21:48:53 INFO input.FileInputFormat: Total input paths to process : 2
18/11/11 21:48:53 INFO mapreduce.JobSubmitter: number of splits:2
18/11/11 21:48:54 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1541944054972_0001
18/11/11 21:48:54 INFO impl.YarnClientImpl: Submitted application application_1541944054972_0001
18/11/11 21:48:55 INFO mapreduce.Job: The url to track the job: http://master.hanli.com:8088/proxy/application_1541944054972_0001/
18/11/11 21:48:55 INFO mapreduce.Job: Running job: job_1541944054972_0001
18/11/11 21:49:07 INFO mapreduce.Job: Job job_1541944054972_0001 running in uber mode : false
18/11/11 21:49:07 INFO mapreduce.Job: map 0% reduce 0%
18/11/11 21:49:20 INFO mapreduce.Job: map 100% reduce 0%
18/11/11 21:49:27 INFO mapreduce.Job: map 100% reduce 100%
18/11/11 21:49:28 INFO mapreduce.Job: Job job_1541944054972_0001 completed successfully
18/11/11 21:49:28 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=55
FILE: Number of bytes written=368179
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=243
HDFS: Number of bytes written=25
HDFS: Number of read operations=9
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=2
Launched reduce tasks=1
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=21571
Total time spent by all reduces in occupied slots (ms)=4843
Total time spent by all map tasks (ms)=21571
Total time spent by all reduce tasks (ms)=4843
Total vcore-milliseconds taken by all map tasks=21571
Total vcore-milliseconds taken by all reduce tasks=4843
Total megabyte-milliseconds taken by all map tasks=22088704
Total megabyte-milliseconds taken by all reduce tasks=4959232
Map-Reduce Framework
Map input records=2
Map output records=4
Map output bytes=41
Map output materialized bytes=61
Input split bytes=218
Combine input records=4
Combine output records=4
Reduce input groups=3
Reduce shuffle bytes=61
Reduce input records=4
Reduce output records=3
Spilled Records=8
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=370
CPU time spent (ms)=1780
Physical memory (bytes) snapshot=492568576
Virtual memory (bytes) snapshot=6234935296
Total committed heap usage (bytes)=264638464
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=25
File Output Format Counters
Bytes Written=25
4.檢視結果
[[email protected]] /usr/local/hadoop/sbin$ hadoop fs -ls /output
Found 2 items
-rw-r--r-- 3 root supergroup 0 2018-11-11 21:49 /output/_SUCCESS
-rw-r--r-- 3 root supergroup 25 2018-11-11 21:49 /output/part-r-00000
[[email protected]] /usr/local/hadoop/sbin$ hadoop fs -cat /output/part-r-00000
hadoop 1
hello 2
world 1
報錯
卡住不動,
檢視日誌
節點日誌:/usr/local/hadoop/logs/yarn-root-nodemanager-slave1.hanli.com.log
master資源排程器日誌:/usr/local/hadoop/logs/yarn-root-resourcemanager-master.hanli.com.log
參考文章 https://www.codetd.com/article/132972,https://www.cnblogs.com/xiangyangzhu/p/5711549.html
[[email protected]] /usr/local/hadoop/sbin$ hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.6.jar wordcount /input /output
18/11/11 21:26:50 INFO client.RMProxy: Connecting to ResourceManager at /192.168.255.130:8032
18/11/11 21:26:52 INFO input.FileInputFormat: Total input paths to process : 2
18/11/11 21:26:53 INFO mapreduce.JobSubmitter: number of splits:2
18/11/11 21:26:53 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1541942707552_0001
18/11/11 21:26:53 INFO impl.YarnClientImpl: Submitted application application_1541942707552_0001
18/11/11 21:26:53 INFO mapreduce.Job: The url to track the job: http://master.hanli.com:8088/proxy/application_1541942707552_0001/
18/11/11 21:26:53 INFO mapreduce.Job: Running job: job_1541942707552_0001
最後解決辦法:在yare-site.xml裡新增如下資訊之後問題得到解決
<property>
<name>yarn.resourcemanager.address</name>
<value>master:8032</value>
</property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>master:8030</value>
</property>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>master:8031</value>
</property>
(注意我將master、slave1、slave2這個檔案都修改了,是不是隻修改master就可以,不清楚,但是初步判斷應該全部修改)
另外有人提到上述配置是預設的,這就不得而知了,總之問題是解決了~