寫一個測試用例testcase,分別驗證TXT檔案和gzip檔案的可平行計算性?
寫一個測試用例testcase,分別驗證TXT檔案和gzip檔案的可平行計算性?
1)TXT和gzip檔案準備OK,放到hdfs上去,各自的大小必須大於一個block塊。
2)寫hivesql,通過某種計算兩種不同形式的資料檔案對應的表,檢視其map個數的差異
3)下個結論
txt檔案測試:
TXT壓縮成gzip檔案的時候保留原TXT檔案:gzip -c input.txt 就生成了gzip,保留TXT
cp test.txt test2.txt 先複製一個檔案
cat test2.txt>>test.txt追加檔案,擴大記憶體至128兆以上(多次追加)
du -sh * 檢視記憶體大小
hdfs dfs -copyFromLocal test.txt /tmp/niuniu/test.txt 上傳hdfs
hdfs dfs -du -h /tmp/niuniu/test.txt 檢視hdfs檔案大小
create table practice_score (
stdno string,
courseNo string,
score int,
opDate string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'; 建立表格
load data inpath '/tmp/niuniu/test.txt' into table practice_score;把hdfs上的TXT檔案上傳到表格中
(其實是將資料上傳到了location裡面(show create table practice_score檢視))第一次傳資料必須要把資料用load進表(表示表裡面就已經有資料了),之後再新增資料可以直接通過hdfs上傳到location裡面(即使格式變了)
hdfs dfs -put test.txt +表的location地址(apps後面)
select stdno,count(1)
from practice_score
group by stdno; 寫一個觸發mrjob的select語句
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 4
2018-11-28 00:07:15,242 Stage-1 map = 0%, reduce = 0%
2018-11-28 00:07:26,875 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 9.03 sec
2018-11-28 00:07:34,245 Stage-1 map = 100%, reduce = 75%, Cumulative CPU 13.62 sec
2018-11-28 00:07:36,347 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 15.1 sec
csv檔案測試:
cp test.csv test2.csv 先複製一個檔案
cat test2.csv>>test.csv追加檔案,擴大記憶體至128兆以上(多次追加)
du -sh * 檢視記憶體大小
hdfs dfs -copyFromLocal test.csv /tmp/niuniu/test.csv 上傳hdfs
hdfs dfs -du -h /tmp/niuniu/test.csv 檢視hdfs檔案大小
create table practice_score2 (
stdno string,
courseNo string,
score int,
opDate string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','; 建立表格
load data inpath '/tmp/niuniu/test.csv' into table practice_score2; 把hdfs上的csv檔案上傳到表格中
select stdno,count(1)
from practice_score2
group by stdno; 寫一個觸發mrjob的select語句
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 3
2018-11-28 10:33:33,800 Stage-1 map = 0%, reduce = 0%
2018-11-28 10:33:45,500 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 3.59 sec
2018-11-28 10:33:46,544 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 8.62 sec
2018-11-28 10:33:52,837 Stage-1 map = 100%, reduce = 33%, Cumulative CPU 10.12 sec
2018-11-28 10:33:53,877 Stage-1 map = 100%, reduce = 67%, Cumulative CPU 11.65 sec
2018-11-28 10:33:57,021 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 13.29 sec
gzip檔案測試:
gzip test.csv
gzip test2.csv
cat test2.csv.gz>>test.csv.gz
du -h test.csv.gz
hdfs dfs -copyFromLocal test.csv.gz /tmp/niuniu/test.csv.gz
hdfs dfs -ls /tmp/niuniu/
du -h test.csv.gz
create table practice_score3 (
stdno string,
courseNo string,
opDate string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t';
load data inpath '/tmp/niuniu/test.csv.gz' into table practice_score3;
select stdno,count(1)
from practice_score3
group by stdno;
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 3
2018-12-16 19:34:08,478 Stage-1 map = 0%, reduce = 0%
2018-12-16 19:34:28,306 Stage-1 map = 1%, reduce = 0%, Cumulative CPU 20.36 sec
2018-12-16 19:34:46,987 Stage-1 map = 2%, reduce = 0%, Cumulative CPU 39.52 sec
2018-12-16 19:35:07,703 Stage-1 map = 3%, reduce = 0%, Cumulative CPU 61.67 sec
2018-12-16 19:35:29,461 Stage-1 map = 4%, reduce = 0%, Cumulative CPU 83.87 sec
2018-12-16 19:35:50,155 Stage-1 map = 5%, reduce = 0%, Cumulative CPU 106.1 sec
2018-12-16 19:36:11,877 Stage-1 map = 6%, reduce = 0%, Cumulative CPU 128.24 sec
2018-12-16 19:36:29,492 Stage-1 map = 7%, reduce = 0%, Cumulative CPU 147.23 sec
2018-12-16 19:36:50,198 Stage-1 map = 8%, reduce = 0%, Cumulative CPU 169.25 sec
2018-12-16 19:37:11,931 Stage-1 map = 9%, reduce = 0%, Cumulative CPU 191.29 sec
2018-12-16 19:37:32,627 Stage-1 map = 10%, reduce = 0%, Cumulative CPU 213.4 sec
2018-12-16 19:37:54,335 Stage-1 map = 11%, reduce = 0%, Cumulative CPU 235.52 sec
2018-12-16 19:38:11,911 Stage-1 map = 12%, reduce = 0%, Cumulative CPU 254.43 sec
2018-12-16 19:38:33,651 Stage-1 map = 13%, reduce = 0%, Cumulative CPU 276.54 sec
2018-12-16 19:38:54,336 Stage-1 map = 14%, reduce = 0%, Cumulative CPU 298.66 sec
2018-12-16 19:39:15,004 Stage-1 map = 15%, reduce = 0%, Cumulative CPU 320.77 sec
2018-12-16 19:39:36,658 Stage-1 map = 16%, reduce = 0%, Cumulative CPU 342.83 sec
2018-12-16 19:39:54,180 Stage-1 map = 17%, reduce = 0%, Cumulative CPU 361.83 sec
2018-12-16 19:40:15,819 Stage-1 map = 18%, reduce = 0%, Cumulative CPU 383.99 sec
2018-12-16 19:40:36,431 Stage-1 map = 19%, reduce = 0%, Cumulative CPU 406.01 sec
2018-12-16 19:40:58,079 Stage-1 map = 20%, reduce = 0%, Cumulative CPU 428.04 sec
2018-12-16 19:41:18,681 Stage-1 map = 21%, reduce = 0%, Cumulative CPU 450.2 sec
2018-12-16 19:41:40,328 Stage-1 map = 22%, reduce = 0%, Cumulative CPU 472.23 sec
2018-12-16 19:42:00,919 Stage-1 map = 23%, reduce = 0%, Cumulative CPU 494.25 sec
2018-12-16 19:42:22,539 Stage-1 map = 24%, reduce = 0%, Cumulative CPU 516.36 sec
2018-12-16 19:42:41,084 Stage-1 map = 25%, reduce = 0%, Cumulative CPU 535.33 sec
2018-12-16 19:43:01,691 Stage-1 map = 26%, reduce = 0%, Cumulative CPU 557.43 sec
2018-12-16 19:43:23,309 Stage-1 map = 27%, reduce = 0%, Cumulative CPU 579.46 sec
2018-12-16 19:43:43,909 Stage-1 map = 28%, reduce = 0%, Cumulative CPU 601.55 sec
2018-12-16 19:44:05,527 Stage-1 map = 29%, reduce = 0%, Cumulative CPU 623.59 sec
2018-12-16 19:44:23,039 Stage-1 map = 30%, reduce = 0%, Cumulative CPU 642.58 sec
2018-12-16 19:44:44,697 Stage-1 map = 31%, reduce = 0%, Cumulative CPU 664.76 sec
2018-12-16 19:45:05,293 Stage-1 map = 32%, reduce = 0%, Cumulative CPU 686.87 sec
2018-12-16 19:45:26,910 Stage-1 map = 33%, reduce = 0%, Cumulative CPU 708.99 sec
2018-12-16 19:45:47,530 Stage-1 map = 34%, reduce = 0%, Cumulative CPU 731.03 sec
2018-12-16 19:46:09,143 Stage-1 map = 35%, reduce = 0%, Cumulative CPU 753.15 sec
2018-12-16 19:46:29,726 Stage-1 map = 36%, reduce = 0%, Cumulative CPU 775.2 sec
2018-12-16 19:46:51,346 Stage-1 map = 37%, reduce = 0%, Cumulative CPU 797.28 sec
2018-12-16 19:47:08,848 Stage-1 map = 38%, reduce = 0%, Cumulative CPU 816.21 sec
2018-12-16 19:47:30,475 Stage-1 map = 39%, reduce = 0%, Cumulative CPU 838.32 sec
2018-12-16 19:47:51,088 Stage-1 map = 40%, reduce = 0%, Cumulative CPU 860.41 sec
2018-12-16 19:48:12,697 Stage-1 map = 41%, reduce = 0%, Cumulative CPU 882.53 sec
2018-12-16 19:48:33,283 Stage-1 map = 42%, reduce = 0%, Cumulative CPU 904.66 sec
2018-12-16 19:48:51,830 Stage-1 map = 43%, reduce = 0%, Cumulative CPU 923.53 sec
2018-12-16 19:49:12,424 Stage-1 map = 44%, reduce = 0%, Cumulative CPU 945.61 sec
2018-12-16 19:49:34,051 Stage-1 map = 45%, reduce = 0%, Cumulative CPU 967.7 sec
2018-12-16 19:49:54,643 Stage-1 map = 46%, reduce = 0%, Cumulative CPU 989.84 sec
2018-12-16 19:50:16,280 Stage-1 map = 47%, reduce = 0%, Cumulative CPU 1011.93 sec
2018-12-16 19:50:34,821 Stage-1 map = 48%, reduce = 0%, Cumulative CPU 1030.79 sec
2018-12-16 19:50:55,415 Stage-1 map = 49%, reduce = 0%, Cumulative CPU 1052.83 sec
2018-12-16 19:51:17,048 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 1074.88 sec
2018-12-16 19:51:37,663 Stage-1 map = 51%, reduce = 0%, Cumulative CPU 1096.9 sec
2018-12-16 19:51:59,293 Stage-1 map = 52%, reduce = 0%, Cumulative CPU 1119.07 sec
2018-12-16 19:52:19,875 Stage-1 map = 53%, reduce = 0%, Cumulative CPU 1141.19 sec
2018-12-16 19:52:41,503 Stage-1 map = 54%, reduce = 0%, Cumulative CPU 1163.33 sec
2018-12-16 19:52:59,005 Stage-1 map = 55%, reduce = 0%, Cumulative CPU 1182.25 sec
2018-12-16 19:53:20,624 Stage-1 map = 56%, reduce = 0%, Cumulative CPU 1204.3 sec
2018-12-16 19:53:41,241 Stage-1 map = 57%, reduce = 0%, Cumulative CPU 1226.43 sec
2018-12-16 19:54:02,875 Stage-1 map = 58%, reduce = 0%, Cumulative CPU 1248.59 sec
2018-12-16 19:54:23,457 Stage-1 map = 59%, reduce = 0%, Cumulative CPU 1270.61 sec
2018-12-16 19:54:44,437 Stage-1 map = 60%, reduce = 0%, Cumulative CPU 1292.64 sec
2018-12-16 19:55:22,313 Stage-1 map = 61%, reduce = 0%, Cumulative CPU 1330.56 sec
2018-12-16 19:55:45,583 Stage-1 map = 62%, reduce = 0%, Cumulative CPU 1355.82 sec
2018-12-16 19:56:04,191 Stage-1 map = 63%, reduce = 0%, Cumulative CPU 1374.78 sec
2018-12-16 19:56:24,908 Stage-1 map = 64%, reduce = 0%, Cumulative CPU 1396.93 sec
2018-12-16 19:56:45,732 Stage-1 map = 65%, reduce = 0%, Cumulative CPU 1418.99 sec
2018-12-16 19:57:07,375 Stage-1 map = 66%, reduce = 0%, Cumulative CPU 1441.0 sec
2018-12-16 19:57:28,046 Stage-1 map = 67%, reduce = 0%, Cumulative CPU 1463.07 sec
2018-12-16 19:57:31,150 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1466.62 sec
2018-12-16 19:57:38,401 Stage-1 map = 100%, reduce = 33%, Cumulative CPU 1468.17 sec
2018-12-16 19:57:40,490 Stage-1 map = 100%, reduce = 67%, Cumulative CPU 1469.75 sec
2018-12-16 19:57:47,709 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 1471.64 sec
MapReduce Total cumulative CPU time: 24 minutes 31 seconds 640 msec
Ended Job = job_1543589162810_0714
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 3 Cumulative CPU: 1471.72 sec HDFS Read: 180142367 HDFS Write: 436 SUCCESS
Total MapReduce CPU Time Spent: 24 minutes 31 seconds 720 msec
OK
對比分析,text和csv檔案都是可切分的,所以有多個mapper,進行map任務,速度相對較快。而gzip是不可切分的,只有一個mapper在map階段,所以速度就相對慢很多。