1. 程式人生 > >寫一個測試用例testcase,分別驗證TXT檔案和gzip檔案的可平行計算性?

寫一個測試用例testcase,分別驗證TXT檔案和gzip檔案的可平行計算性?

寫一個測試用例testcase,分別驗證TXT檔案和gzip檔案的可平行計算性?

1)TXT和gzip檔案準備OK,放到hdfs上去,各自的大小必須大於一個block塊。

2)寫hivesql,通過某種計算兩種不同形式的資料檔案對應的表,檢視其map個數的差異

3)下個結論

txt檔案測試:

TXT壓縮成gzip檔案的時候保留原TXT檔案:gzip -c input.txt 就生成了gzip,保留TXT

cp test.txt test2.txt  先複製一個檔案

cat test2.txt>>test.txt追加檔案,擴大記憶體至128兆以上(多次追加)

du -sh *  檢視記憶體大小

 hdfs dfs -copyFromLocal test.txt /tmp/niuniu/test.txt  上傳hdfs

hdfs dfs -du -h /tmp/niuniu/test.txt  檢視hdfs檔案大小

create table practice_score (
stdno string,
courseNo string,
score int,
opDate string
)  
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '\t';  建立表格

load data inpath '/tmp/niuniu/test.txt' into table practice_score;把hdfs上的TXT檔案上傳到表格中

(其實是將資料上傳到了location裡面(show create table practice_score檢視))第一次傳資料必須要把資料用load進表(表示表裡面就已經有資料了),之後再新增資料可以直接通過hdfs上傳到location裡面(即使格式變了)

                hdfs dfs -put test.txt +表的location地址(apps後面)

select stdno,count(1)
from practice_score
group by stdno;   寫一個觸發mrjob的select語句

Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 4

2018-11-28 00:07:15,242 Stage-1 map = 0%,  reduce = 0%
2018-11-28 00:07:26,875 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 9.03 sec
2018-11-28 00:07:34,245 Stage-1 map = 100%,  reduce = 75%, Cumulative CPU 13.62 sec
2018-11-28 00:07:36,347 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 15.1 sec

csv檔案測試:

cp test.csv test2.csv  先複製一個檔案

cat test2.csv>>test.csv追加檔案,擴大記憶體至128兆以上(多次追加)

du -sh * 檢視記憶體大小

hdfs dfs -copyFromLocal test.csv /tmp/niuniu/test.csv  上傳hdfs

hdfs dfs -du -h /tmp/niuniu/test.csv  檢視hdfs檔案大小

create table practice_score2 (
stdno string,
courseNo string,
score int,
opDate string
)  
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ',';  建立表格

load data inpath '/tmp/niuniu/test.csv' into table practice_score2; 把hdfs上的csv檔案上傳到表格中

select stdno,count(1)
from practice_score2
group by stdno;   寫一個觸發mrjob的select語句

Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 3
2018-11-28 10:33:33,800 Stage-1 map = 0%,  reduce = 0%
2018-11-28 10:33:45,500 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 3.59 sec
2018-11-28 10:33:46,544 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 8.62 sec
2018-11-28 10:33:52,837 Stage-1 map = 100%,  reduce = 33%, Cumulative CPU 10.12 sec
2018-11-28 10:33:53,877 Stage-1 map = 100%,  reduce = 67%, Cumulative CPU 11.65 sec
2018-11-28 10:33:57,021 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 13.29 sec

gzip檔案測試:

gzip test.csv

gzip test2.csv

cat test2.csv.gz>>test.csv.gz

du -h test.csv.gz

hdfs dfs -copyFromLocal test.csv.gz /tmp/niuniu/test.csv.gz

hdfs dfs -ls /tmp/niuniu/

du -h test.csv.gz

create table practice_score3 (
stdno string,
courseNo string,
opDate string
)  
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '\t';

load data inpath '/tmp/niuniu/test.csv.gz' into table practice_score3;

select stdno,count(1)
from practice_score3
group by stdno;

Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 3
2018-12-16 19:34:08,478 Stage-1 map = 0%,  reduce = 0%
2018-12-16 19:34:28,306 Stage-1 map = 1%,  reduce = 0%, Cumulative CPU 20.36 sec
2018-12-16 19:34:46,987 Stage-1 map = 2%,  reduce = 0%, Cumulative CPU 39.52 sec
2018-12-16 19:35:07,703 Stage-1 map = 3%,  reduce = 0%, Cumulative CPU 61.67 sec
2018-12-16 19:35:29,461 Stage-1 map = 4%,  reduce = 0%, Cumulative CPU 83.87 sec
2018-12-16 19:35:50,155 Stage-1 map = 5%,  reduce = 0%, Cumulative CPU 106.1 sec
2018-12-16 19:36:11,877 Stage-1 map = 6%,  reduce = 0%, Cumulative CPU 128.24 sec
2018-12-16 19:36:29,492 Stage-1 map = 7%,  reduce = 0%, Cumulative CPU 147.23 sec
2018-12-16 19:36:50,198 Stage-1 map = 8%,  reduce = 0%, Cumulative CPU 169.25 sec
2018-12-16 19:37:11,931 Stage-1 map = 9%,  reduce = 0%, Cumulative CPU 191.29 sec
2018-12-16 19:37:32,627 Stage-1 map = 10%,  reduce = 0%, Cumulative CPU 213.4 sec
2018-12-16 19:37:54,335 Stage-1 map = 11%,  reduce = 0%, Cumulative CPU 235.52 sec
2018-12-16 19:38:11,911 Stage-1 map = 12%,  reduce = 0%, Cumulative CPU 254.43 sec
2018-12-16 19:38:33,651 Stage-1 map = 13%,  reduce = 0%, Cumulative CPU 276.54 sec
2018-12-16 19:38:54,336 Stage-1 map = 14%,  reduce = 0%, Cumulative CPU 298.66 sec
2018-12-16 19:39:15,004 Stage-1 map = 15%,  reduce = 0%, Cumulative CPU 320.77 sec
2018-12-16 19:39:36,658 Stage-1 map = 16%,  reduce = 0%, Cumulative CPU 342.83 sec
2018-12-16 19:39:54,180 Stage-1 map = 17%,  reduce = 0%, Cumulative CPU 361.83 sec
2018-12-16 19:40:15,819 Stage-1 map = 18%,  reduce = 0%, Cumulative CPU 383.99 sec
2018-12-16 19:40:36,431 Stage-1 map = 19%,  reduce = 0%, Cumulative CPU 406.01 sec
2018-12-16 19:40:58,079 Stage-1 map = 20%,  reduce = 0%, Cumulative CPU 428.04 sec
2018-12-16 19:41:18,681 Stage-1 map = 21%,  reduce = 0%, Cumulative CPU 450.2 sec
2018-12-16 19:41:40,328 Stage-1 map = 22%,  reduce = 0%, Cumulative CPU 472.23 sec
2018-12-16 19:42:00,919 Stage-1 map = 23%,  reduce = 0%, Cumulative CPU 494.25 sec
2018-12-16 19:42:22,539 Stage-1 map = 24%,  reduce = 0%, Cumulative CPU 516.36 sec
2018-12-16 19:42:41,084 Stage-1 map = 25%,  reduce = 0%, Cumulative CPU 535.33 sec
2018-12-16 19:43:01,691 Stage-1 map = 26%,  reduce = 0%, Cumulative CPU 557.43 sec
2018-12-16 19:43:23,309 Stage-1 map = 27%,  reduce = 0%, Cumulative CPU 579.46 sec
2018-12-16 19:43:43,909 Stage-1 map = 28%,  reduce = 0%, Cumulative CPU 601.55 sec
2018-12-16 19:44:05,527 Stage-1 map = 29%,  reduce = 0%, Cumulative CPU 623.59 sec
2018-12-16 19:44:23,039 Stage-1 map = 30%,  reduce = 0%, Cumulative CPU 642.58 sec
2018-12-16 19:44:44,697 Stage-1 map = 31%,  reduce = 0%, Cumulative CPU 664.76 sec
2018-12-16 19:45:05,293 Stage-1 map = 32%,  reduce = 0%, Cumulative CPU 686.87 sec
2018-12-16 19:45:26,910 Stage-1 map = 33%,  reduce = 0%, Cumulative CPU 708.99 sec
2018-12-16 19:45:47,530 Stage-1 map = 34%,  reduce = 0%, Cumulative CPU 731.03 sec
2018-12-16 19:46:09,143 Stage-1 map = 35%,  reduce = 0%, Cumulative CPU 753.15 sec
2018-12-16 19:46:29,726 Stage-1 map = 36%,  reduce = 0%, Cumulative CPU 775.2 sec
2018-12-16 19:46:51,346 Stage-1 map = 37%,  reduce = 0%, Cumulative CPU 797.28 sec
2018-12-16 19:47:08,848 Stage-1 map = 38%,  reduce = 0%, Cumulative CPU 816.21 sec
2018-12-16 19:47:30,475 Stage-1 map = 39%,  reduce = 0%, Cumulative CPU 838.32 sec
2018-12-16 19:47:51,088 Stage-1 map = 40%,  reduce = 0%, Cumulative CPU 860.41 sec
2018-12-16 19:48:12,697 Stage-1 map = 41%,  reduce = 0%, Cumulative CPU 882.53 sec
2018-12-16 19:48:33,283 Stage-1 map = 42%,  reduce = 0%, Cumulative CPU 904.66 sec
2018-12-16 19:48:51,830 Stage-1 map = 43%,  reduce = 0%, Cumulative CPU 923.53 sec
2018-12-16 19:49:12,424 Stage-1 map = 44%,  reduce = 0%, Cumulative CPU 945.61 sec
2018-12-16 19:49:34,051 Stage-1 map = 45%,  reduce = 0%, Cumulative CPU 967.7 sec
2018-12-16 19:49:54,643 Stage-1 map = 46%,  reduce = 0%, Cumulative CPU 989.84 sec
2018-12-16 19:50:16,280 Stage-1 map = 47%,  reduce = 0%, Cumulative CPU 1011.93 sec
2018-12-16 19:50:34,821 Stage-1 map = 48%,  reduce = 0%, Cumulative CPU 1030.79 sec
2018-12-16 19:50:55,415 Stage-1 map = 49%,  reduce = 0%, Cumulative CPU 1052.83 sec
2018-12-16 19:51:17,048 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 1074.88 sec
2018-12-16 19:51:37,663 Stage-1 map = 51%,  reduce = 0%, Cumulative CPU 1096.9 sec
2018-12-16 19:51:59,293 Stage-1 map = 52%,  reduce = 0%, Cumulative CPU 1119.07 sec
2018-12-16 19:52:19,875 Stage-1 map = 53%,  reduce = 0%, Cumulative CPU 1141.19 sec
2018-12-16 19:52:41,503 Stage-1 map = 54%,  reduce = 0%, Cumulative CPU 1163.33 sec
2018-12-16 19:52:59,005 Stage-1 map = 55%,  reduce = 0%, Cumulative CPU 1182.25 sec
2018-12-16 19:53:20,624 Stage-1 map = 56%,  reduce = 0%, Cumulative CPU 1204.3 sec
2018-12-16 19:53:41,241 Stage-1 map = 57%,  reduce = 0%, Cumulative CPU 1226.43 sec
2018-12-16 19:54:02,875 Stage-1 map = 58%,  reduce = 0%, Cumulative CPU 1248.59 sec
2018-12-16 19:54:23,457 Stage-1 map = 59%,  reduce = 0%, Cumulative CPU 1270.61 sec
2018-12-16 19:54:44,437 Stage-1 map = 60%,  reduce = 0%, Cumulative CPU 1292.64 sec
2018-12-16 19:55:22,313 Stage-1 map = 61%,  reduce = 0%, Cumulative CPU 1330.56 sec
2018-12-16 19:55:45,583 Stage-1 map = 62%,  reduce = 0%, Cumulative CPU 1355.82 sec
2018-12-16 19:56:04,191 Stage-1 map = 63%,  reduce = 0%, Cumulative CPU 1374.78 sec
2018-12-16 19:56:24,908 Stage-1 map = 64%,  reduce = 0%, Cumulative CPU 1396.93 sec
2018-12-16 19:56:45,732 Stage-1 map = 65%,  reduce = 0%, Cumulative CPU 1418.99 sec
2018-12-16 19:57:07,375 Stage-1 map = 66%,  reduce = 0%, Cumulative CPU 1441.0 sec
2018-12-16 19:57:28,046 Stage-1 map = 67%,  reduce = 0%, Cumulative CPU 1463.07 sec
2018-12-16 19:57:31,150 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1466.62 sec
2018-12-16 19:57:38,401 Stage-1 map = 100%,  reduce = 33%, Cumulative CPU 1468.17 sec
2018-12-16 19:57:40,490 Stage-1 map = 100%,  reduce = 67%, Cumulative CPU 1469.75 sec
2018-12-16 19:57:47,709 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 1471.64 sec
MapReduce Total cumulative CPU time: 24 minutes 31 seconds 640 msec
Ended Job = job_1543589162810_0714
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 3   Cumulative CPU: 1471.72 sec   HDFS Read: 180142367 HDFS Write: 436 SUCCESS
Total MapReduce CPU Time Spent: 24 minutes 31 seconds 720 msec
OK

對比分析,text和csv檔案都是可切分的,所以有多個mapper,進行map任務,速度相對較快。而gzip是不可切分的,只有一個mapper在map階段,所以速度就相對慢很多。