hadoop 使用map合並小文件到SequenceFile

阿新 • • 發佈：2019-02-18

耗時 art 合並 next entity 繼承 name for each nes

上一例是直接用SequenceFile的createWriter來實現，本例采用mapreduce的方式。

1、把小文件整體讀入需要自定義InputFormat格式，自定義InputFormat格式需要先定義RecordReader讀取方式，為了整體讀入，RecordReader使用一次性讀入所有字節。

1.1 繼承RecordReader泛型，重寫這個類。

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataInputStream;
import org.apache.hadoop.fs.FileSystem;
 
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.IOUtils;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
 
import org.apache.hadoop.mapreduce.lib.input.FileSplit;

import java.io.IOException;

public class WholeFileRecordReader extends RecordReader<NullWritable,BytesWritable> {
    private FileSplit fileSplit;
    private Configuration conf;
    private BytesWritable value = new BytesWritable();
     
private boolean processed = false;
    /**
     * Called once at initialization.
     *
     * @param split   the split that defines the range of records to read
     * @param context the information about the task
     * @throws IOException
     * @throws InterruptedException
     */
    @Override
    public void initialize(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
        this.fileSplit = (FileSplit) split;
        this.conf = context.getConfiguration();
    }

    /**
     * Read the next key, value pair.
     *
     * @return true if a key/value pair was read
     * @throws IOException
     * @throws InterruptedException
     */
    @Override
    public boolean nextKeyValue() throws IOException, InterruptedException {
        if(!processed){
            byte[] contents = new byte[(int)fileSplit.getLength()];
            Path file = fileSplit.getPath();
            FileSystem fs = file.getFileSystem(conf);
            FSDataInputStream in = null;
            try {
                in = fs.open(file);
                IOUtils.readFully(in,contents,0,contents.length);//一次全部讀取
                value.set(contents,0,contents.length);
            }finally {
                IOUtils.closeStream(in);
            }
            processed = true;
            return true;
        }

        return false;
    }

    /**
     * Get the current key
     *
     * @return the current key or null if there is no current key
     * @throws IOException
     * @throws InterruptedException
     */
    @Override
    public NullWritable getCurrentKey() throws IOException, InterruptedException {
        return NullWritable.get();
    }

    /**
     * Get the current value.
     *
     * @return the object that was read
     * @throws IOException
     * @throws InterruptedException
     */
    @Override
    public BytesWritable getCurrentValue() throws IOException, InterruptedException {
        return value;
    }

    /**
     * The current progress of the record reader through its data.
     *
     * @return a number between 0.0 and 1.0 that is the fraction of the data read
     * @throws IOException
     * @throws InterruptedException
     */
    @Override
    public float getProgress() throws IOException, InterruptedException {
        return processed ? 1.0f:0.0f;
    }

    /**
     * Close the record reader.
     */
    @Override
    public void close() throws IOException {

    }
}

1.2 繼承FileInputFormat泛型，重寫文件輸入格式

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.JobContext;
import org.apache.hadoop.mapreduce.RecordReader;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import java.io.IOException;


public class WholeFileInputFormat extends FileInputFormat<NullWritable,BytesWritable> {
    /**
     * Is the given filename splittable? Usually, true, but if the file is
     * stream compressed, it will not be.
     * <p>
     * The default implementation in <code>FileInputFormat</code> always returns
     * true. Implementations that may deal with non-splittable files <i>must</i>
     * override this method.
     * <p>
     * <code>FileInputFormat</code> implementations can override this and return
     * <code>false</code> to ensure that individual input files are never split-up
     * so that  process entire files.
     *
     * @param context  the job context
     * @param filename the file name to check
     * @return is this file splitable?
     */
    @Override
    protected boolean isSplitable(JobContext context, Path filename) {
        return false;//文件不分片，為了整體讀入
    }

    /**
     * Create a record reader for a given split. The framework will call
     * {@link RecordReader#initialize(InputSplit, TaskAttemptContext)} before
     * the split is used.
     *
     * @param split   the split to be read
     * @param context the information about the task
     * @return a new record reader
     * @throws IOException
     * @throws InterruptedException
     */
    @Override
    public RecordReader<NullWritable, BytesWritable> createRecordReader(InputSplit split, TaskAttemptContext context) throws IOException, InterruptedException {
        WholeFileRecordReader recordReader = new WholeFileRecordReader();
        recordReader.initialize(split,context);
        return recordReader;
    }
}

2、MAPPER，不要寫reduce，本例只是合並文件。

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.InputSplit;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;

import java.io.IOException;


public class SequenceFileMapper extends Mapper<NullWritable,BytesWritable,Text,BytesWritable> {
    private Text filenameKey;
    /**
     * Called once at the beginning of the task.
     *
     * @param context
     */
    @Override
    protected void setup(Context context) throws IOException, InterruptedException {
        InputSplit split = context.getInputSplit();
        Path path = ((FileSplit)split).getPath();
        filenameKey = new Text(path.toString());
    }

    /**
     * Called once for each key/value pair in the input split. Most applications
     * should override this, but the default is the identity function.
     *
     * @param key
     * @param value
     * @param context
     */
    @Override
    protected void map(NullWritable key, BytesWritable value, Context context) throws IOException, InterruptedException {
        context.write(filenameKey,value);
    }
}

3、執行job，使用輔助類Tool，也可以不用，直接寫job執行就可以。

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.BytesWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class SmallFilesToSequenceFileConverter extends Configured implements Tool {

    /**
     * Execute the command with the given arguments.
     *
     * @param args command specific arguments.
     * @return exit code.
     * @throws Exception
     */
    @Override
    public int run(String[] args) throws Exception {
        Configuration conf = getConf();
        if(conf==null){
            return -1;
        }

        Path outPath = new Path(args[1]);
        FileSystem fileSystem = outPath.getFileSystem(conf);
        //刪除輸出路徑
        if(fileSystem.exists(outPath))
        {
            fileSystem.delete(outPath,true);
        }

        Job job = Job.getInstance(conf,"SmallFilesToSequenceFile");
        job.setJarByClass(SmallFilesToSequenceFileConverter.class);

        job.setMapperClass(SequenceFileMapper.class);

        job.setInputFormatClass(WholeFileInputFormat.class);
        job.setOutputFormatClass(SequenceFileOutputFormat.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(BytesWritable.class);



        FileInputFormat.addInputPath(job,new Path(args[0]));
        FileOutputFormat.setOutputPath(job,new Path(args[1]));


        return job.waitForCompletion(true) ? 0:1;
    }

    public static void main(String[] args) throws Exception{
        long startTime = System.currentTimeMillis();

        int exitCode = ToolRunner.run(new SmallFilesToSequenceFileConverter(),args);
        System.exit(exitCode);

        long endTime = System.currentTimeMillis();
        long timeSpan = endTime - startTime;
        System.out.println("運行耗時："+timeSpan+"毫秒。");
    }
}

4、上傳集群運行，打包成jar包的時候把META-INF目錄和src目錄放同級，防止找不到函數入口。

#手動調整reduce數量為2，運算後會生成兩個part
[hadoop@bigdata-senior01 ~]$ hadoop jar SmallFilesToSequenceFileConverter.jar -D mapreduce.job.reduces=2 /demo /output3


...
[hadoop@bigdata-senior01 ~]$ hadoop fs -ls /output3
Found 3 items
-rw-r--r--   1 hadoop supergroup          0 2019-02-18 16:17 /output3/_SUCCESS
-rw-r--r--   1 hadoop supergroup      60072 2019-02-18 16:17 /output3/part-r-00000
-rw-r--r--   1 hadoop supergroup      28520 2019-02-18 16:17 /output3/part-r-00001

hadoop 使用map合並小文件到SequenceFile

耗時 art 合並 next entity 繼承 name for each nes 上一例是直接用SequenceFile的createWriter來實現，本例采用mapreduce的方式。 1、把小文件整體讀入需要自定義InputFormat格式，自定義InputFor

使用Powerpoint for macos自動合並pptx文件

macos powerpoint vba 合並 pptx ‘ ‘ references: ‘ https://www.rondebruin.nl/mac/mac015.htm ‘ https://stackoverflow.com/questions/5316459/programmati

Git合並指定文件到另一個分支

修改切換 mit 需要 cnblogs -- tro copy 內容經常被問到如何從一個分支合並特定的文件到另一個分支。其實，只合並你需要的那些commits，不需要的commits就不合並進去了。合並某個分支上的單個commit 首先，用git log或sour

【daily】文件分割限速下載,及合並分割文件

但我 redist lar for @override files fse exception 調度說明　　主要功能: 　　　　1) 分割文件, 生成下載任務; 　　　　2) 定時任務: 檢索需要下載的任務, 利用多線程下載並限制下載速度; 　　　　3) 定時任務: 檢

使用Python批量合並PDF文件(帶書簽功能)

pri turn os.walk gpa spl pan 情況下 split join 網上找了幾個合並pdf的軟件，發現不是很好用，一般都沒有添加書簽的功能。在網上查找了python合並pdf的腳本，發現也沒有添加書簽的功能。於是自己動手編寫了一個小工具，代碼如下：

bcftools合並vcf文件

.so play merge html 命令 vcf .gz tsl controls 見命令： bcftools merge A.vcf.gz B.vcf.gz C.vcf.gz -Oz -o ABC.vcf.gz 　　參考鏈接：http://vcftoo

合並CSV文件.bat

echo light pause 合並 key .bat div off 需要 @echo off E:\保存文件夾 cd E:\文件所在的文件夾 dir copy *.csv all_keywords.csv echo @@@@@@@@@@@@@合並成功！@@@@@@@

第五篇：文件合並與文件歸檔

方式文件的打包指定 name 輸出內容參數 tex -c 文件合並與文件歸檔 1.> 表示把>左邊命令的輸出內容覆蓋到右邊 >> 表示把>>左邊命令的輸出內容追加到右邊例：文件合並 cat a.txt b.txt>c.

使用ffmpeg批量合並flv文件

數字 ffmpeg 命令 list 去掉 als 重命名 echo 字符 title: 使用ffmpeg批量合並flv文件 toc: false date: 2018-10-14 16:08:19 categories: methods tags: ffmpeg f

如何合並PDF文件？教你幾種超簡單的方法

點擊路徑成了 bad 菜單欄方法彈窗鼠標功能如何合並PDF文件呢？我們在工作中會遇到很多難以處理的文件，比如PDF文件就是一種，尤其是將多個PDF文件合並成一個PDF文件，，其實大多數人都不知道將其合並，盲目的在網上找相關的方法，到頭來還是不行，達不到我們理想

HDFS操作及小文件合並

讀取輸入文件路徑 cal final .config block 輸出流上傳 txt文件小文件合並是針對文件上傳到HDFS之前這些文件夾裏面都是小文件參考代碼 package com.gong.hadoop2

大數據：Map終結和Spill文件合並

lpar adr 滿足 tty 機制執行c cleanup 並且數據當Mapper沒有數據輸入，mapper.run中的while循環會調用context.nextKeyValue就返回false，於是便返回到runNewMapper中，在這裏程序會關閉輸入通道和輸出

java基礎 File與遞歸練習使用文件過濾器篩選將指定文件夾下的小於200K的小文件獲取並打印按層次打印(包括所有子文件夾的文件)

tor accep length 按層 col 不存在 args name style package com.swift.kuozhan; import java.io.File; import java.io.FileFilter; /*使用文件過濾器篩選將指定文

大數據-Hadoop小文件問題解決方案

擴展名 term efi 外部情況 -o r文件圖片 Oz HDFS中小文件是指文件size小於HDFS上block(dfs block size)大小的文件。大量的小文件會給Hadoop的擴展性和性能帶來嚴重的影響。HDFS中小文件是指文件size小於HDFS上blo

JavaScript實現拖拽預覽，AJAX小文件上傳

strong 名稱獲取文件是否 set 可能 lis idt scrip 本地上傳，提前預覽（圖片。視頻） 1.html中div標簽預覽顯示。button標簽觸發上傳事件。 <div id="drop_area" style=

linux下刪除大量小文件

小文件刪除 linux1、Linux下怎麽刪除大量的小文件或者以數字開頭的小文件？解答：模擬環境，在/tmp目錄下創建1000000個小文件cd /tmp && touch {1..1000000}方法1:[[email protected]/* */ tmp]# ls|eg

阿裏雲 oss 小文件上傳進度顯示

int 說明 str utils -o 線程 .get popu generate 對阿裏雲OSS上傳小文件時的進度，想過兩個方法：一是。通過多線程監測Inputstream剩余的字節數來計算，可是由於Inputstream在兩個線程中共用，假設上傳線程將Inp

基於hbase+hdfs的小文件(圖片)存儲

current 創建表 2.7 con all getc close 讀取 println 圖片文件一般在100k一下，質量好一些的在幾百k，特殊的圖像可能達到10m左右，如果直接存儲在hdfs上會對namenode的內存造成很大的壓力，因為namenode的內存中會存儲每

基於mapfile的小文件存儲方案

stack ren out util before not 兩個 bsp long 1.采用mapfile存儲小文件，會自動創建兩個sequenceFile文件：data和index。數據存儲在data中，index存儲data中存儲的文件的key（排好序的）。這樣可以實現

PDF編輯器讓大家都能合並PDF文檔從此不求人

開始 php 一個 one 什麽我們準備就緒 tar 解決問題有很多網友對如何將兩個PDF文檔合並成一個PDF傷透了腦筋，不是在貼吧裏求助，就在在問答知乎上提問，今天小編就要化身為解決問題小能手，幫助大家解決這個問題，讓大家都能合並PDF文檔，從此不求人！首先，我

hadoop 使用map合並小文件到SequenceFile

相關推薦