倒排索引的分散式實現（MapReduce程式）

阿新 • • 發佈：2018-12-22

package aturbo.index.inverted;


import java.io.IOException;
import java.util.HashSet;


import org.apache.commons.lang3.StringUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;


public class InvertedIndex {


	public static class Map extends Mapper<LongWritable, Text, Text, Text>{
		private Text documentId;
		private Text word = new Text();
		
		@Override
		protected void setup(Context context){
			String filename = ((FileSplit)context.getInputSplit()).getPath().getName();
			documentId = new Text(filename);
		}
		
		@Override
		protected void map(LongWritable key,Text value,Context context)throws IOException,InterruptedException{
			for(String token:StringUtils.split(value.toString())){
				word.set(token);
				context.write(word, documentId);
			}
		}
	}
	
	public static class Reduce extends Reducer<Text, Text, Text, Text>{
		
		private Text docIds = new Text();
		public void reduce(Text key,Iterable<Text> values,Context context)throws IOException,InterruptedException{
			HashSet<Text> uniqueDocIds = new HashSet<Text>();
			for(Text docId:values){
				uniqueDocIds.add(docId);
			}
			docIds.set(new Text(StringUtils.join(uniqueDocIds, ",")));
			context.write(key, docIds);
		}
	}
	
	public static void main(String[] args)throws Exception{
		Configuration conf = new Configuration();
		
		String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
		if(otherArgs.length!=2){
			System.err.println("Usage:InvertedIndex<in><out>");
			System.exit(2);
		}
		
		Job job = new Job(conf,"inverted index");
		job.setJarByClass(InvertedIndex.class);
		job.setMapperClass(Map.class);
		job.setCombinerClass(Reduce.class);
		job.setReducerClass(Reduce.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(Text.class);
		FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
		FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
		System.exit(job.waitForCompletion(true)?0:1);
	}
}

倒排索引的分散式實現（MapReduce程式）

package aturbo.index.inverted; import java.io.IOException; import java.util.HashSet; import org.apache.commons.lang3.StringUtils; imp

MapReduce 倒排索引的實現

package cheryl.dhcc.mapreduce; import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configu

ElasticSearch倒排索引原理揭祕——基於mapreduce實現自己的倒排索引

Elasticsearch簡單介紹 Elasticsearch (ES)是一個基於Lucene構建的開源、分散式、REST

倒排索引 mr實現

Map階段 <0,"this is google"> .... context.write("google ->a.txt",1); context.write("google -&g

倒排索引C++實現

倒排索引原理：根據屬性的值來查詢記錄位置。假設有3篇文章，file1, file2, file3，檔案內容如下： file1 (單詞1，單詞2，單詞3，單詞4....) file2 (單詞a，單詞b

倒排索引的實現

https://blog.csdn.net/xn4545945/article/details/8791484倒排索引（英語：Inverted index），也常被稱為反向索引、置入檔案或反向檔案，是一種索引方法，被用來儲存在全文搜尋下某個單詞在一個文件或者一組文件中的儲存位

C++ 倒排索引的實現

1.1基本介紹倒排索引的概念很簡單：就是將檔案中的單詞作為關鍵字，然後建立單詞與檔案的對映關係。當然，你還可以新增檔案中單詞出現的頻數等資訊。倒排索引是搜尋引擎中一個很基本的概念，幾乎所有的搜尋引擎都會使用到倒排索引。 1.2 準備工作 ² 5個原始檔 Test0

MapReduce框架學習（4）——倒排索引程式實戰

參考： JeffreyZhou的部落格園《Hadoop權威指南》第四版 0 倒排索引（Inverted Index）前面我們執行過WordCount例子，得到的單詞計數結果，如果輸入3篇文件，得到

2018-08-03 期 MapReduce倒排索引編程案例1（Combiner方式）

pre true 輸出 hello pub 類型 rom 偏移 apr package cn.sjq.bigdata.inverted.index;import java.io.IOException;import org.apache.hadoop.conf.Config

2018-08-04 期 MapReduce倒排索引編程案例2（jobControll方式）

基本正常 org gets [] pro stat context 控制器 1、第一階段MapReduce任務程序package cn.itcast.bigdata.index;import java.io.IOException;import org.apache.ha

MapReduce入門（三）倒排索引

什麼是倒排索引？倒排索引源於實際應用中需要根據屬性的值來查詢記錄。這種索引表中的每一項都包括一個屬性值和具有該屬性值的各記錄的地址。由於不是由記錄來確定屬性值，而是由屬性值來確定記錄的位置，因而稱為倒排索引(inverted index)。帶有倒排索

MapReduce實現倒排索引

倒排索引這個名字讓人很容易誤解成A-Z，倒排成Z-A；但實際上缺不是這樣的。一般我們是根據問檔案來確定檔案內容，而倒排索引是指通過檔案內容來得到文件的資訊，也就是根據一些單詞判斷他在哪個檔案中。知道了這一點下面就好做了：準備一些元資料下面我們要進行兩次MapR

倒排索引構建演算法SPIMI（已實現，修訂版）

TA011121600045170###347###A0###2###20111214213127###86b4bc20eb98b1eb21932ebf5dcfcca5###1###蘭州###空氣質量# TA011121600045168###347###A0###2###20111215181000###e

一些演算法的MapReduce實現——倒排索引實現

/** * input format * docid<tab>doc content * * output format * (term:docid)<tab>(tf in this doc) * */ public s

mapreduce系列（6）---倒排索引的建立

一、概述如我們有三個檔案： a.txt,b.txt,c.txt tian jun li lei han meimei li lei han meimei li lei han meimei tian jun gege jiejie tian jun

程式設計師程式設計藝術第二十六章：基於給定的文件生成倒排索引（含原始碼下載）

第二十六章：基於給定的文件生成倒排索引的編碼與實踐作者：July、yansha。出處：結構之法演算法之道引言本週實現倒排索引。實現過程中，尋找資料，結果發現找份資料諸多不易：1、網上搜倒排索引實現，結果千篇一律，例子都是那幾個同樣的單詞；2、到谷歌學術上想找點稍微有價

Hadoop—MapReduce練習（資料去重、資料排序、平均成績、倒排索引）

1. wordcount程式先以簡單的wordcount為例。 Mapper： package cn.nuc.hadoop.mapreduce.wordcount; import java.io.IOException; import org.apache.com

我愛分享----百萬商業圈C語言實現的倒排索引算法(含全部源碼)

db4 cover cst via com deb nio main 20M PAT-1134VertexCover（圖的建立+set容器）刷題——POJ2395OutofHay QGC之QGCView.qml HDU-2049不容易系列之四（考新郎） 2e5訟矣屎htt

Elastic 之倒排索引（二）

mage bsp post elastic 分享圖片位置 png blog 通過常規索引建立：　　文檔--》關鍵詞的映射過程（正向索引）缺點：費時便利全部文檔倒排反向建立索引：　　關鍵詞--》文檔的映射反向到倒排索引：將索引的關鍵詞出現的文檔的位置和出現頻率

mapreduce 高級案例倒排索引

大數據 hadoop mapreduce 倒排索引理解【倒排索引】的功能熟悉mapreduce 中的combine 功能根據需求編碼實現【倒排索引】的功能，旨在理解mapreduce 的功能。一：理解【倒排索引】的功能 1.1 倒排索引：由於不是根據文檔來確定文檔

倒排索引的分散式實現（MapReduce程式）

相關推薦