1. 程式人生 > >hive中對lzo壓縮檔案建立索引實現並行處理

hive中對lzo壓縮檔案建立索引實現並行處理

1,確保建立索引

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/lib/hadoop-lzo-0.4.10.jar  com.hadoop.compression.lzo.DistributedLzoIndexer /user/hive/warehouse/flog


2 如果在hive中新建外部表的語句為

CREATE EXTERNAL TABLE foo (
         columnA string,
         columnB string )
    PARTITIONED BY (date string)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t"
    STORED AS INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"
          OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"
    LOCATION '/path/to/hive/tables/foo';

3  對於已經存在的表修改語句為

ALTER TABLE foo
    SET FILEFORMAT
        INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"
        OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";

4 alter table後對已經load進表中的資料,需要重新load和建立索引,要不還是不能分塊

5 用hadoop streaming程式設計執行mapreduce作業語句為

hadoop jar /opt/mapr/hadoop/hadoop-0.20.2/contrib/streaming/hadoop-0.20.2-dev-streaming.jar -file /home/pyshell/map.py -file /home/pyshell/red.py  -mapper /home/pyshell/map.py -reducer /home/pyshell/red.py -input /aojianlog/20120304/gold/gold_38_3.csv.lzo -output /aojianresult/gold38 -inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat -jobconf mapred.output.compress=true -jobconf mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec

注意 如果沒有-inputformat com.hadoop.mapred.DeprecatedLzoTextInputFormat 選項的話map作業也不會分片

沒有-jobconf mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec選項只設置-jobconf mapred.output.compress=true 選項的話 reduce作業輸出檔案的格式為.lzo_deflate