mapreduce系列（6）---倒排索引的建立

阿新 • • 發佈：2018-12-22

一、概述

如我們有三個檔案：
a.txt,b.txt,c.txt

tian jun
li lei
han meimei
li lei
han meimei

li lei
han meimei
tian jun
gege
jiejie
tian jun
gege
jiejie

gege
jiejie
han meimei
tian jun
han meimei
tian jun

統計出沒個詞在每篇文章中出現的次數，這就是倒排索引了，效果如下：

gege    b.txt-->2,c.txt-->1
han     a.txt-->2,b.txt-->1 
,c.txt-->2
jiejie  b.txt-->2,c.txt-->1
jun     c.txt-->2,b.txt-->2,a.txt-->1
lei     b.txt-->1,a.txt-->2
li      a.txt-->2,b.txt-->1
meimei  a.txt-->2,b.txt-->1,c.txt-->2
tian    b.txt-->2,c.txt-->2,a.txt-->1

思路分析：
在mr程式中是通過相同的key來進行歸併的，抓住這點，我們可以想到，把某個詞加上它所屬的檔名合起來組成一個key，這不就是離我們需要的結果很近了，但是可以看到，一個mr很難實現，所以在這個基礎上，我們只需把key和value對換，換下前一個key的顯示格式，通過兩個mr就可以實現我們的需求。

二、程式碼實現

inverIndexStepOne.java

package inverIndex;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache 
.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;

/**
 * Created by tianjun on 2017/3/20.
 */
public class inverIndexStepOne {

    static class InverIndexStepOneMapper extends Mapper<LongWritable,Text,Text,IntWritable> {

        Text k = new Text();
        IntWritable v = new IntWritable(1);

        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String line = value.toString();
            String[] words = line.split(" ");
            FileSplit inputSplit = (FileSplit) context.getInputSplit();
            String filename = inputSplit.getPath().getName();
            for(String word : words){
                k.set(word+"--"+filename);
                context.write(k,v);
            }
        }
    }

    static class InverIndexStepOneReducer extends Reducer<Text,IntWritable,Text,IntWritable>{
        @Override
        protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
            int count = 0;
            for(IntWritable value : values){
                count += value.get();
            }
            context.write(key,new IntWritable(count));
        }
    }

    public static void main(String[] args) throws IOException, URISyntaxException, ClassNotFoundException, InterruptedException {

        String os = System.getProperty("os.name").toLowerCase();
        if (os.contains("windows")) {
            System.setProperty("HADOOP_USER_NAME", "root");
        }

        Configuration conf = new Configuration();

        conf.set("mapreduce.framework.name","yarn");
        conf.set("yarn.resourcemanager.hostname","mini01");
        conf.set("fs.defaultFS","hdfs://mini01:9000/");

//            預設就是local模式
//        conf.set("mapreduce.framework.name","local");
//        conf.set("mapreduce.jobtracker.address","local");
//        conf.set("fs.defaultFS","file:///");


        Job wcjob = Job.getInstance(conf);

        wcjob.setJar("F:/myWorkPlace/java/dubbo/demo/dubbo-demo/mr-demo1/target/mr.demo-1.0-SNAPSHOT.jar");

        //如果從本地拷貝，是不行的，這時需要使用setJar
//        wcjob.setJarByClass(Rjoin.class);

        wcjob.setMapperClass(InverIndexStepOneMapper.class);
        wcjob.setReducerClass(InverIndexStepOneReducer.class);

        //設定我們的業務邏輯Mapper類的輸出key和value的資料型別
        wcjob.setMapOutputKeyClass(Text.class);
        wcjob.setMapOutputValueClass(IntWritable.class);


        //設定我們的業務邏輯Reducer類的輸出key和value的資料型別
        wcjob.setOutputKeyClass(Text.class);
        wcjob.setOutputValueClass(IntWritable.class);


        //如果不設定InputFormat，預設就是使用TextInputFormat.class
//        wcjob.setInputFormatClass(CombineFileInputFormat.class);
//        CombineFileInputFormat.setMaxInputSplitSize(wcjob,4194304);
//        CombineFileInputFormat.setMinInputSplitSize(wcjob,2097152);


        FileSystem fs = FileSystem.get(new URI("hdfs://mini01:9000"), new Configuration(), "root");
        Path path = new Path("hdfs://mini01:9000/wc/index/stepone");
        if (fs.exists(path)) {
            fs.delete(path, true);
        }

        //指定要處理的資料所在的位置
        FileInputFormat.setInputPaths(wcjob, new Path("hdfs://mini01:9000/input/index"));
        //指定處理完成之後的結果所儲存的位置
        FileOutputFormat.setOutputPath(wcjob, new Path("hdfs://mini01:9000/wc/index/stepone"));

        boolean res = wcjob.waitForCompletion(true);
        System.exit(res ? 0 : 1);

    }
}

inverIndexStepTwo.java

package inverIndex;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;
import java.net.URI;
import java.net.URISyntaxException;

/**
 * Created by tianjun on 2017/3/20.
 */
public class inverIndexStepTwo {

    static class inverIndexStepTwoMapper extends Mapper<LongWritable,Text,Text,Text> {

        Text k = new Text();
        IntWritable v = new IntWritable(1);

        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String line = value.toString();
            String[] word_file = line.split("--");
            String temp = word_file[1].replace("\t","-->");
            context.write(new Text(word_file[0]),new Text(temp));
        }
    }

    static class inverIndexStepTwoReducer extends Reducer<Text,Text,Text,Text>{
        @Override
        protected void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
            StringBuffer sb = new StringBuffer();
            for(Text value : values){
                if(sb.length()!=0){
                    sb.append(",");
                }
                sb.append(value.toString());

            }
            context.write(key,new Text(sb.toString()));
        }
    }

    public static void main(String[] args) throws IOException, URISyntaxException, ClassNotFoundException, InterruptedException {

        String os = System.getProperty("os.name").toLowerCase();
        if (os.contains("windows")) {
            System.setProperty("HADOOP_USER_NAME", "root");
        }

        Configuration conf = new Configuration();

        conf.set("mapreduce.framework.name","yarn");
        conf.set("yarn.resourcemanager.hostname","mini01");
        conf.set("fs.defaultFS","hdfs://mini01:9000/");

//            預設就是local模式
//        conf.set("mapreduce.framework.name","local");
//        conf.set("mapreduce.jobtracker.address","local");
//        conf.set("fs.defaultFS","file:///");


        Job wcjob = Job.getInstance(conf);

        wcjob.setJar("F:/myWorkPlace/java/dubbo/demo/dubbo-demo/mr-demo1/target/mr.demo-1.0-SNAPSHOT.jar");

        //如果從本地拷貝，是不行的，這時需要使用setJar
//        wcjob.setJarByClass(Rjoin.class);

        wcjob.setMapperClass(inverIndexStepTwoMapper.class);
        wcjob.setReducerClass(inverIndexStepTwoReducer.class);

        //設定我們的業務邏輯Mapper類的輸出key和value的資料型別
        wcjob.setMapOutputKeyClass(Text.class);
        wcjob.setMapOutputValueClass(Text.class);


        //設定我們的業務邏輯Reducer類的輸出key和value的資料型別
        wcjob.setOutputKeyClass(Text.class);
        wcjob.setOutputValueClass(Text.class);


        //如果不設定InputFormat，預設就是使用TextInputFormat.class
//        wcjob.setInputFormatClass(CombineFileInputFormat.class);
//        CombineFileInputFormat.setMaxInputSplitSize(wcjob,4194304);
//        CombineFileInputFormat.setMinInputSplitSize(wcjob,2097152);


        FileSystem fs = FileSystem.get(new URI("hdfs://mini01:9000"), new Configuration(), "root");
        Path path = new Path("hdfs://mini01:9000/wc/index/steptwo");
        if (fs.exists(path)) {
            fs.delete(path, true);
        }

        //指定要處理的資料所在的位置
//        FileInputFormat.setInputPaths(wcjob, new Path("hdfs://mini01:9000/input/index"));
        FileInputFormat.setInputPaths(wcjob, new Path("hdfs://mini01:9000/wc/index/stepone"));
        //指定處理完成之後的結果所儲存的位置
//        FileOutputFormat.setOutputPath(wcjob, new Path("hdfs://mini01:9000/wc/index/stepone"));
        FileOutputFormat.setOutputPath(wcjob, new Path("hdfs://mini01:9000/wc/index/steptwo"));

        boolean res = wcjob.waitForCompletion(true);
        System.exit(res ? 0 : 1);

    }
}

這樣就可以計算出上述的需求

mapreduce系列（6）---倒排索引的建立

一、概述如我們有三個檔案： a.txt,b.txt,c.txt tian jun li lei han meimei li lei han meimei li lei han meimei tian jun gege jiejie tian jun

MapReduce入門（三）倒排索引

什麼是倒排索引？倒排索引源於實際應用中需要根據屬性的值來查詢記錄。這種索引表中的每一項都包括一個屬性值和具有該屬性值的各記錄的地址。由於不是由記錄來確定屬性值，而是由屬性值來確定記錄的位置，因而稱為倒排索引(inverted index)。帶有倒排索

MapReduce框架學習（4）——倒排索引程式實戰

參考： JeffreyZhou的部落格園《Hadoop權威指南》第四版 0 倒排索引（Inverted Index）前面我們執行過WordCount例子，得到的單詞計數結果，如果輸入3篇文件，得到

小實踐（5）倒排索引

背景搜尋引擎通常都會建立關鍵字的倒排索引，由關鍵字為index，後面跟著包含該關鍵字的網頁，本次使用模擬資料，簡要嘗試一下，建立倒排索引的過程。資料：第一個元素為書名字，後面以空格分割，為書的關鍵字。spark版本：<dependency>

海量資料處理專題（八）——倒排索引(搜尋引擎之基石)(轉)

引言：在資訊大爆炸的今天，有了搜尋引擎的幫助，使得我們能夠快速，便捷的找到所求。提到搜尋引擎，就不得不說VSM模型，說到VSM，就不得不聊倒排索引。可以毫不誇張的講，倒排索引是搜尋引擎的基石。VSM檢索模型VSM全稱是Vector Space Model(向量空間模型)，是IR(Information Ret

【Elasticsearch 7 探索之路】（三）倒排索引

上一篇，我們介紹了 ES 文件的基本 CURE 和批量操作。我們都知道倒排索引是搜尋引擎非常重要的一種資料結構，什麼是倒排索引，倒排索引的原理是什麼。 1 索引過程在講解倒排索引前，我們先了解索引建立，下圖是 Elasticsearch 中資料索引過程的流程。從上圖可以看到，文件未在 ES 中進行索引

ElasticSearch最佳入門實踐（三十九）倒排索引核心原理揭祕

1、例子，兩段文字 doc1：I really liked my small dogs, and I think my mom also liked them doc2：He never liked any dogs, so I hope that my m

ElasticSearch最佳入門實踐（六十六）倒排索引組成結構以及其索引可變原因

倒排索引，是適合用於進行搜尋的倒排索引的結構（1）包含這個關鍵詞的document list （2）包含這個關鍵詞的所有document的數量：IDF（inverse document frequency）（3）這個關鍵詞在每個document中出現的次數：TF（ter

Lucene 初學者實戰（二）正排索引與倒排索引

Lucene：基於傳統全文檢索引擎的倒排索引，並實現了分塊索引。與倒排所引相對立的是正排索引，也成為正向所引。本文將簡單介紹。 1 正排索引（forward index）由key查詢實體的過程，是正排索引. 在搜尋引擎中每個檔案都對應一個檔案ID，檔案內容被表示為一

Python開發MapReduce系列（一）WordCount Demo

logs 3-9 line counter ota python開發 home num brush 　原創，轉發請註明出處。　　MapReduce是hadoop這只大象的核心，Hadoop 中，數據處理核心就是 MapReduce 程序設計模型。一個Map/Reduc

HD-ACM算法專攻系列（6）——Big Number

ostream math image main 代碼 bsp str fine sum 題目描述：源碼： #include"iostream" #include"cmath" using namespace std; #define PI 3.1415926 #d

搜索引擎基礎概念（3）—— 倒排列表

相關整數原因 tro tex 進行大於 http 1-1 　　倒排列表　　　　倒排列表用來記錄有哪些文檔包含了某個單詞。一般在文檔集合裏會有很多文檔包含某個單詞，每個文檔會記錄文檔編號（DocID），單詞在這個文檔中出現的次數（TF）及單詞在文檔中哪些位置出現過等

MapReduce--帶有詞頻統計的倒排索引演算法

倒排索引：根據單詞來查詢文件實現：單詞1 文件1：次數，文件2：次數，文件5：次數單詞1 平均次數單詞2 文件3：次數，文件6：次數單詞2 平均次數 Mapper: 輸出： key: term- ->docid value: 1 public static cla

python與zmq系列（6）

現在，你已經熟練的掌握了REQ/REP模式，它是一個一對多的模式，一個REP對應多個REQ。但是現實工作中，我們會遇到這樣的難題，一個REP無法滿足REQ的提問，因為REQ太多了，雖然可以增加一個REP，但是，這樣做會帶來很多問題。兩個R

arcgis jsapi介面入門系列（6）：樣式

symbol: function () { //線樣式 //樣式詳情請看官方文件 let style = { //線顏色，支援多種格式：

C++菜鳥學習筆記系列（6）——簡單標頭檔案的編寫

C++菜鳥學習筆記系列（6） ——簡單標頭檔案的編寫我們在上一篇部落格 C++菜鳥學習筆記系列（5）中已經敘述了一些關於在C++中建立自己的資料型別的一些方法，但是隨之而來的一個問題是我們在建立了一個自定義類之後經常還要在其他的檔案中使用同樣的類，這時候我們可

Mongodb 學習筆記（6）地理空間索引初探，經緯度測試

Mongodb自支援地理空間查詢，筆者稍微測試了一下經緯度功能。這裡我討論的是地球面上，點的查詢。 1. 首先需要為位置欄位，定義指定的GeoJSON格式，如下： location : { type: "Point", coordinates: [ -

python快速學習系列（6）：面向物件程式設計（OOP）

一、面向物件程式設計： 1.比設計模式更重要的是設計原則： 1）面向物件設計的目標： ·可擴充套件：新特性很容易新增到現有系統中，基本不影響系統原有功能 ·可修改：當修改某一部分程式碼時，不會影響到其他不相關的部分 ·可替代：用具有相同介面的程式碼去替換系統中某一部分程式碼時，系統不受影

Hadoop 之 MapReduce 的工作原理及其倒排索引的建立

一、Hadoop 簡介下面先從一張圖理解MapReduce得整個工作原理下面對上面出現的一些名詞進行介紹ResourceManager：是YARN資源控制框架的中心模組，負責叢集中所有的資源的統一管理和分配。它接收來自NM(NodeManager)的彙報，建立AM，

SpringMVC學習系列（6）之資料驗證

在系列（4）、（5）中我們展示瞭如何繫結資料，繫結完資料之後如何確保我們得到的資料的正確性？這就是我們本篇要說的內容 —> 資料驗證。這裡我們採用Hibernate-validator來進行驗證，Hibernate-validator實現了JSR-303驗證框架支援註解風格的驗證。首先我們要到htt

mapreduce系列（6）---倒排索引的建立

一、概述

二、程式碼實現

相關推薦