hadoop mapreduce開發實踐之HDFS文件分發by streaming
阿新 • • 發佈:2018-01-27
submit ast nap direct 如同 lis slots cal ado 1、分發HDFS文件(-cacheFile)
需求:wordcount(只統計指定的單詞),但是該文件非常大,可以先將該文件上傳到hdfs,通過-cacheFile的方式進行分發;
-cachefile hdfs://host:port/path/to/file#linkname #選項在計算節點上緩存文件,streaming程序通過./linkname的方式訪問文件。
思路:mapper和reducer程序都不需要修改,只是在運行streaming的時候需要使用-cacheFile 指定hdfs上的文件;
1.1、streaming命令格式
$HADOOP_HOME/bin/hadoop jar hadoop-streaming.jar -jobconf mapred.job.name="streaming_wordcount" -jobconf mapred.job.priority=3 -input /input/ -output /output/ -mapper "python mapper.py whc" -reducer "python reducer.py" -cacheFile "hdfs://master:9000/cache_file/wordwhite#whc" -file ./mapper.py -file ./reducer.py
註:-cacheFile "hdfs://master:9000/cache_file/wordwhite#whc"
whc表示在hdfs上該文件的別名,在-mapper "python mapper.py whc"
就如同使用本地文件一樣。
1.2、上傳wordwhite
$ hadoop fs -mkdir /input/cachefile $ hadoop fs -put wordwhite /input/cachefile $ hadoop fs -ls /input/cachefile Found 1 items -rw-r--r-- 1 hadoop supergroup 12 2018-01-26 15:02 /input/cachefile/wordwhite $ hadoop fs -text hdfs://localhost:9000/input/cachefile/wordwhite the and had
1.3 run_streaming程序
mapper和reducer程序參考本地分發實例
$ vim runstreaming_cachefile.sh #!/bin/bash HADOOP_CMD="/home/hadoop/app/hadoop/hadoop-2.6.0-cdh5.13.0/bin/hadoop" STREAM_JAR_PATH="/home/hadoop/app/hadoop/hadoop-2.6.0-cdh5.13.0/share/hadoop/tools/lib/hadoop-streaming-2.6.0-cdh5.13.0.jar" INPUT_FILE_PATH="/input/The_Man_of_Property" OUTPUT_FILE_PATH="/output/wordcount/wordwhitecachefiletest" $HADOOP_CMD jar $STREAM_JAR_PATH -input $INPUT_FILE_PATH -output $OUTPUT_FILE_PATH -jobconf "mapred.job.name=wordcount_wordwhite_cachefile_demo" -mapper "python mapper.py WHF" -reducer "python reducer.py" -cacheFile "hdfs://localhost:9000/input/cachefile/wordwhite#WHF" -file ./mapper.py -file ./reducer.py
1.4、執行程序
$ ./runstreaming_cachefile.sh
18/01/26 15:38:27 WARN streaming.StreamJob: -file option is deprecated, please use generic option -files instead.
18/01/26 15:38:28 WARN streaming.StreamJob: -cacheFile option is deprecated, please use -files instead.
18/01/26 15:38:28 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
18/01/26 15:38:28 INFO Configuration.deprecation: mapred.job.name is deprecated. Instead, use mapreduce.job.name
packageJobJar: [./mapper.py, ./reducer.py, /tmp/hadoop-unjar1709565523181962236/] [] /tmp/streamjob6164905989972408041.jar tmpDir=null
18/01/26 15:38:29 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/01/26 15:38:29 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
18/01/26 15:38:31 INFO mapred.FileInputFormat: Total input paths to process : 1
18/01/26 15:38:31 INFO mapreduce.JobSubmitter: number of splits:2
18/01/26 15:38:32 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1516345010544_0012
18/01/26 15:38:32 INFO impl.YarnClientImpl: Submitted application application_1516345010544_0012
18/01/26 15:38:32 INFO mapreduce.Job: The url to track the job: http://localhost:8088/proxy/application_1516345010544_0012/
18/01/26 15:38:32 INFO mapreduce.Job: Running job: job_1516345010544_0012
18/01/26 15:38:40 INFO mapreduce.Job: Job job_1516345010544_0012 running in uber mode : false
18/01/26 15:38:40 INFO mapreduce.Job: map 0% reduce 0%
18/01/26 15:38:49 INFO mapreduce.Job: map 50% reduce 0%
18/01/26 15:38:50 INFO mapreduce.Job: map 100% reduce 0%
18/01/26 15:38:57 INFO mapreduce.Job: map 100% reduce 100%
18/01/26 15:38:57 INFO mapreduce.Job: Job job_1516345010544_0012 completed successfully
18/01/26 15:38:57 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=73950
FILE: Number of bytes written=582590
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=636501
HDFS: Number of bytes written=27
HDFS: Number of read operations=9
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=2
Launched reduce tasks=1
Data-local map tasks=2
Total time spent by all maps in occupied slots (ms)=12921
Total time spent by all reduces in occupied slots (ms)=5641
Total time spent by all map tasks (ms)=12921
Total time spent by all reduce tasks (ms)=5641
Total vcore-milliseconds taken by all map tasks=12921
Total vcore-milliseconds taken by all reduce tasks=5641
Total megabyte-milliseconds taken by all map tasks=13231104
Total megabyte-milliseconds taken by all reduce tasks=5776384
Map-Reduce Framework
Map input records=2866
Map output records=9243
Map output bytes=55458
Map output materialized bytes=73956
Input split bytes=198
Combine input records=0
Combine output records=0
Reduce input groups=3
Reduce shuffle bytes=73956
Reduce input records=9243
Reduce output records=3
Spilled Records=18486
Shuffled Maps =2
Failed Shuffles=0
Merged Map outputs=2
GC time elapsed (ms)=360
CPU time spent (ms)=3910
Physical memory (bytes) snapshot=719896576
Virtual memory (bytes) snapshot=8331550720
Total committed heap usage (bytes)=602931200
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=636303
File Output Format Counters
Bytes Written=27
18/01/26 15:38:57 INFO streaming.StreamJob: Output directory: /output/wordcount/wordwhitecachefiletest
1.5、查看結果
$ hadoop fs -ls /output/wordcount/wordwhitecachefiletest
Found 2 items
-rw-r--r-- 1 hadoop supergroup 0 2018-01-26 15:38 /output/wordcount/wordwhitecachefiletest/_SUCCESS
-rw-r--r-- 1 hadoop supergroup 27 2018-01-26 15:38 /output/wordcount/wordwhitecachefiletest/part-00000
$ hadoop fs -text /output/wordcount/wordwhitecachefiletest/part-00000
and 2573
had 1526
the 5144
以上就完成了分發HDFS上的文件並指定單詞的wordcount.
2、hadoop streaming 語法參考
- http://blog.51cto.com/balich/2065419
hadoop mapreduce開發實踐之HDFS文件分發by streaming