Hadoop自定義讀取檔案

阿新 • • 發佈：2019-01-02

今天從網上看到點資料，很適合用MapReduce來分析一下。一條記錄的格式如下：

[**] [1:538:15] NETBIOS SMB IPC$ unicode share access [**]
[Classification: Generic Protocol Command Decode] [Priority: 3]
09/04-17:53:56.363811 168.150.177.165:1051 -> 168.150.177.166:139
TCP TTL:128 TOS:0x0 ID:4000 IpLen:20 DgmLen:138 DF
***AP*** Seq: 0x2E589B8 Ack: 0x642D47F9 Win: 0x4241 TcpLen: 20

[**] [1:1917:6] SCAN UPnP service discover attempt [**]
[Classification: Detection of a Network Scan] [Priority: 3]
09/04-17:53:56.385573 168.150.177.164:1032 -> 239.255.255.250:1900
UDP TTL:1 TOS:0x0 ID:80 IpLen:20 DgmLen:161
Len: 133

大家可以看到要處理上面的記錄，肯定不能用系統預設的TextInputFormat.class

所以要自己寫一個讀取類，從上面的格式可以看出。每一條記錄由換行符分割，其餘的行為一條記錄（包括多行）。閒話少說，直接上程式碼：

public class MyRecordReader extends RecordReader<IntWritable, Text>{ private static final Log LOG = LogFactory.getLog(MyRecordReader.class); private int pos; private boolean more; private LineReader in; private int maxLineLength; private IntWritable key = null; private Text value1 = null; private String value = ""; public void initialize(InputSplit genericSplit, TaskAttemptContext context) throws IOException { pos = 1; more = true; FileSplit split = (FileSplit) genericSplit; Configuration job = context.getConfiguration(); this.maxLineLength = job.getInt("mapred.linerecordreader.maxlength", Integer.MAX_VALUE); final Path file = split.getPath(); // open the file and seek to the start of the split FileSystem fs = file.getFileSystem(job); FSDataInputStream fileIn = fs.open(split.getPath()); in = new LineReader(fileIn, job); } public boolean nextKeyValue() throws IOException { if (key == null) { key = new IntWritable(); } key.set(pos); if (value1 == null) { value1 = new Text(); } value = ""; int newSize = 0; while (true) { newSize = in.readLine(value1, maxLineLength,maxLineLength); pos++; if (newSize == 0) { //value += value1.toString(); if (!value.isEmpty()){ newSize = 1; } break; } if (newSize == 1) { //當newSize == 1是就是讀取的換行符，所以要輸出。 break; } if (newSize < maxLineLength) { //如果大於1，就證明讀取了一行，但這條記錄並沒有結束。 value += value1.toString(); //中間用空格分割一下 value += " "; //break; } // line too long. try again LOG.info("Skipped line of size " + newSize + " at pos " + (pos - newSize)); } if (newSize == 0) { key = null; //value = ""; more = false; return false; } else { return true; } } @Override public IntWritable getCurrentKey() { return key; } @Override public Text getCurrentValue() { return new Text(value); } /** * Get the progress within the split */ public float getProgress() { if (more) { return 0.0f; } else { return 100; } } public synchronized void close() throws IOException { if (in != null) { in.close(); } } }

通過上面的類，就可以將4行連線為一條記錄。換行符作為一條記錄的結束。

Hadoop自定義讀取檔案

Hadoop自定義讀取檔案

libVLC提取視訊幀及自定義讀取媒體檔案

Springboot讀取配置檔案、pom檔案及自定義配置檔案

spring boot 讀取自定義properties檔案

自定義讀取配置檔案類

springboot---讀取自定義配置檔案

Spring Boot Configuration 配置檔案讀取以及自定義配置檔案

pytorch學習筆記(2)-使用自定義txt檔案讀取資料

Javaweb讀取自定義配置檔案

SpringBoot自定義配置檔案讀取

SpringBoot 之自定義配置檔案及讀取配置檔案application.properties或yml

spring boot 新增自定義配置檔案並讀取屬性

boost自定義讀取ini等檔案的節點值

4.Springboot 之自定義配置檔案及讀取配置檔案

SpringBoot讀取配置檔案的兩種方式以及自定義配置檔案的讀取

讀取application.yml/application.properties中的引數（或讀取自定義配置檔案中的引數）

【無私分享：ASP.NET CORE 專案實戰（第八章）】讀取配置檔案（二）讀取自定義配置檔案

C# WinForm中如何自定義config檔案（XML檔案），並且讀取和儲存它

hadoop程式設計小技巧（5）---自定義輸入檔案格式類InputFormat

springboot讀取自定義配置檔案

Hadoop自定義讀取檔案

相關推薦