MapReduce實現倒排索引

阿新 • • 發佈：2018-12-15

倒排索引這個名字讓人很容易誤解成A-Z，倒排成Z-A；但實際上缺不是這樣的。
一般我們是根據問檔案來確定檔案內容，而倒排索引是指通過檔案內容來得到文件的資訊，也就是根據一些單詞判斷他在哪個檔案中。
知道了這一點下面就好做了：

準備一些元資料

在這裡插入圖片描述
下面我們要進行兩次MapReduce處理

第一次資料處理

package com.invalid;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public static void main(String[] args) throws Exception {
                 //建立連線
		Configuration conf = new Configuration();
		Job job =Job.getInstance(conf);
		//設定要執行的jar
		job.setJarByClass(InvalidDriver.class);
		//設定map和reduce
		job.setMapperClass(InvalidMapper.class);
		job.setReducerClass(InvalidReducer.class);
		//設定map輸出
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(IntWritable.class);
		//設定reduce輸出
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(IntWritable.class);	
		//設定輸入輸出路徑
		FileInputFormat.setInputPaths(job,new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));	
	    //程式執行結果
		boolean result = job.waitForCompletion(true);
		System.out.println(result);		
	}
}
class InvalidMapper extends Mapper<LongWritable,Text,Text,IntWritable>{
	String name;
	Text k = new Text();
	IntWritable v = new IntWritable();
	//setup裡讀取檔名，且只執行一次
	@Override
	protected void setup(Context context) 
			throws IOException ,InterruptedException {
		FileSplit split = (FileSplit) context.getInputSplit();
		name = split.getPath().getName();
	};
	@Override
	protected void map(LongWritable key, Text value,Context context)
			throws IOException ,InterruptedException {
		String[] split = value.toString().split(" ");
		for (String word : split) {
			context.write(new Text(word+"--"+ name), new IntWritable(1));
		}
	}; 	
}
class InvalidReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
	@Override
	protected void reduce(Text key, Iterable<IntWritable> values,Context context)
			throws IOException, InterruptedException {
		int count = 0;
		for (IntWritable value : values) {
			count+=value.get();
		}
		context.write(new Text(key), new IntWritable(count));
	}
}

第一次處理結果如下

在這裡插入圖片描述

第二次處理

package com.invalid;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat；
public class InvalidDriver2 {
	public static void main(String[] args) throws Exception {
		Configuration conf = new Configuration();
		Job job =Job.getInstance(conf);
		job.setJarByClass(InvalidDriver2.class);
		job.setMapperClass(InvalidMapper2.class);
		job.setReducerClass(InvalidReducer2.class);
		job.setMapOutputKeyClass(Text.class);
		job.setMapOutputValueClass(Text.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(Text.class);
		FileInputFormat.setInputPaths(job,new Path(args[0]));
		FileOutputFormat.setOutputPath(job, new Path(args[1]));
		boolean result = job.waitForCompletion(true);
		System.out.println(result);
	}
}
class InvalidMapper2 extends Mapper<LongWritable,Text,Text,Text>{
	@Override
	protected void map(LongWritable key, Text value,Context context)
			throws IOException ,InterruptedException {
		String[] split = value.toString().split("--");
		String keys=split[0];	
		String values=split[1];
		context.write(new Text(keys),new Text(values));	
}
}
class InvalidReducer2 extends Reducer<Text, Text, Text, Text>{;
	@Override
	protected void reduce(Text key, Iterable<Text> values,Context context)
			throws IOException, InterruptedException {
		//String[] split = values.toString().split("\t");
		//String value=null;
		StringBuilder sb = new StringBuilder(); 
		for (Text string: values) {
			
		sb.append(string.toString().replaceAll("\t", "-->")+" ");	
		}
		context.write(new Text(key), new Text(sb.toString()));	
}
}

第二次處理結果

在這裡插入圖片描述

至此，倒排索引完成

MapReduce實現倒排索引

倒排索引這個名字讓人很容易誤解成A-Z，倒排成Z-A；但實際上缺不是這樣的。一般我們是根據問檔案來確定檔案內容，而倒排索引是指通過檔案內容來得到文件的資訊，也就是根據一些單詞判斷他在哪個檔案中。知道了這一點下面就好做了：準備一些元資料下面我們要進行兩次MapR

一些演算法的MapReduce實現——倒排索引實現

/** * input format * docid<tab>doc content * * output format * (term:docid)<tab>(tf in this doc) * */ public s

python 實現倒排索引

程式碼如下： #encoding:utf-8 fin = open('1.txt', 'r') ''' 建立正向索引: “文件1”的ID > 單詞1：出現位置列表；單詞2：出現位置列表；…

Hadoop環境搭建及實現倒排索引

目錄 1.應用介紹 3 1.1實驗環境介紹 3 1.2應用背景介紹 3 1.3應用的意義與價值 4 2.資料及儲存 5 2.1資料來源及資料量 5 2.2資料儲存解決方案 5 3.分析處理架構 5 3.1架構設計和處理方法

MapReduce 倒排索引的實現

package cheryl.dhcc.mapreduce; import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configu

倒排索引的分散式實現（MapReduce程式）

package aturbo.index.inverted; import java.io.IOException; import java.util.HashSet; import org.apache.commons.lang3.StringUtils; imp

ElasticSearch倒排索引原理揭祕——基於mapreduce實現自己的倒排索引

Elasticsearch簡單介紹 Elasticsearch (ES)是一個基於Lucene構建的開源、分散式、REST

我愛分享----百萬商業圈C語言實現的倒排索引算法(含全部源碼)

db4 cover cst via com deb nio main 20M PAT-1134VertexCover（圖的建立+set容器）刷題——POJ2395OutofHay QGC之QGCView.qml HDU-2049不容易系列之四（考新郎） 2e5訟矣屎htt

mapreduce 高級案例倒排索引

大數據 hadoop mapreduce 倒排索引理解【倒排索引】的功能熟悉mapreduce 中的combine 功能根據需求編碼實現【倒排索引】的功能，旨在理解mapreduce 的功能。一：理解【倒排索引】的功能 1.1 倒排索引：由於不是根據文檔來確定文檔

大數據MapReduce入門之倒排索引

tsp 功能 nbsp bstr 生成 path 需要 turn 們的　　在上一篇博客中我們講解了MapReduce的原理以及map和reduce的作用，相信你理解了他們的原理，今天講解的是mapreduce 的另一個就是倒排索引。什麽是倒排索引呢？倒排索

2018-08-03 期 MapReduce倒排索引編程案例1（Combiner方式）

pre true 輸出 hello pub 類型 rom 偏移 apr package cn.sjq.bigdata.inverted.index;import java.io.IOException;import org.apache.hadoop.conf.Config

2018-08-04 期 MapReduce倒排索引編程案例2（jobControll方式）

基本正常 org gets [] pro stat context 控制器 1、第一階段MapReduce任務程序package cn.itcast.bigdata.index;import java.io.IOException;import org.apache.ha

倒排索引 mr實現

Map階段 <0,"this is google"> .... context.write("google ->a.txt",1); context.write("google -&g

倒排索引原理和實現

轉載https://blog.csdn.net/u011239443/article/details/60604017 倒排索引原理和實現關於倒排索引場景是：給定幾個關鍵詞，找出包含關鍵詞的文件倒排索引：不是由記錄來確定屬性值，而是由屬性值來確定記錄的位置

MapReduce框架學習（4）——倒排索引程式實戰

參考： JeffreyZhou的部落格園《Hadoop權威指南》第四版 0 倒排索引（Inverted Index）前面我們執行過WordCount例子，得到的單詞計數結果，如果輸入3篇文件，得到

Lucene全文檢索之倒排索引實現原理、API解析【2018.11】

》官網 http://lucene.apache.org/ 下載地址：https://mirrors.tuna.tsinghua.edu.cn/apache/lucene/java/7.5.0/ 》 Lucene的全文檢索是指什麼：程式掃描文件

MapReduce--帶有詞頻統計的倒排索引演算法

倒排索引：根據單詞來查詢文件實現：單詞1 文件1：次數，文件2：次數，文件5：次數單詞1 平均次數單詞2 文件3：次數，文件6：次數單詞2 平均次數 Mapper: 輸出： key: term- ->docid value: 1 public static cla

MapReduce入門（三）倒排索引

什麼是倒排索引？倒排索引源於實際應用中需要根據屬性的值來查詢記錄。這種索引表中的每一項都包括一個屬性值和具有該屬性值的各記錄的地址。由於不是由記錄來確定屬性值，而是由屬性值來確定記錄的位置，因而稱為倒排索引(inverted index)。帶有倒排索

倒排索引構建演算法SPIMI（已實現，修訂版）

TA011121600045170###347###A0###2###20111214213127###86b4bc20eb98b1eb21932ebf5dcfcca5###1###蘭州###空氣質量# TA011121600045168###347###A0###2###20111215181000###e

倒排索引的java實現

假設有3篇文章，file1, file2, file3，檔案內容如下：檔案內容程式碼 file1 (單詞1，單詞2，單詞3，單詞4....) file2 (單詞a，單詞b，單

MapReduce實現倒排索引

準備一些元資料

第一次資料處理

第一次處理結果如下

第二次處理

第二次處理結果

至此，倒排索引完成

相關推薦