檔案倒排索引演算法及其hadoop實現

阿新 • • 發佈：2018-12-27

什麼是檔案的倒排索引？

簡單講就是一種搜尋引擎的演算法。過倒排索引，可以根據單詞快速獲取包含這個單詞的文件列表。倒排索引主要由兩個部分組成：“單詞”和對應出現的“倒排檔案”。

MapReduce的設計思路

整個過程包含map、combiner、reduce三個階段，它們各自對應的key和value型別如下表所示：

InputKey	InputValue	OutputKey	OutputValue
Map	Object	Text	Text	Text
Combiner	Text	Text	Text	Text
Reduce	Text	Text	Text	Text

使用預設的TextInputFormat讀入檔案，三個部分的具體操作如下：

Map：將每一行的內容分詞，輸出key為“單詞：文章”，輸出value為“出現次數”，這裡是Text型別的“1”；

Combiner：針對每一個輸入key，將value值轉為int數值累加，並將key中的文章放入value，輸出key為“單詞”，輸出value為“文章：出現次數；……”；

Reduce：針對每一個輸入key，以冒號分割，將value值中的出現次數取出來累加，並記錄文章數量，計算出出平均出現次數，輸出key為“單詞平均出現次數”，輸出value為“文章：出現次數；……

”

2. MapReduce的程式碼片段

Map程式碼如下：
public static class Map extends Mapper<Object,Text,Text,Text>
{
   private TextvalueInfo = new Text();
   private TextkeyInfo = new Text();
   privateFileSplit split;
   public void map(Object key, Text value,Context context) throws IOException,InterruptedException
    {
       split =(FileSplit) context.getInputSplit();
       StringTokenizerstk = new StringTokenizer(value.toString());//單詞分割
       while(stk.hasMoreElements()) //還有單詞
       {
           Stringname = split.getPath().getName();//獲取檔名
           intsplitIndex = name.indexOf(".");//獲取檔名中點的位置
           keyInfo.set(stk.nextToken()+ ":" + name.substring(0, splitIndex));//單詞:去後綴檔名
           valueInfo.set("1");//outputValue置為1
           context.write(keyInfo,valueInfo);//寫入context
        }
    }
}
Combiner程式碼如下：
public static class Combiner extends Reducer<Text,Text,Text,Text>
{
   Text info =new Text();
    public void reduce(Text key,Iterable<Text> values,Context context) 
throwsIOException, InterruptedException
    {
       int sum = 0;
       for (Textvalue : values)
       {
           sum +=Integer.parseInt(value.toString());//累加同單詞在同文章中出現次數
       }
       intsplitIndex = key.toString().indexOf(":");//獲取key中的冒號位置
       info.set(key.toString().substring(splitIndex+1)+ ":" + sum);//設定value為文章：次數
       key.set(key.toString().substring(0,splitIndex));//設定key為單詞
       context.write(key,info);//寫入context
    }
}

Reduce程式碼如下：
public static class Reduce extends Reducer<Text,Text,Text,Text>
{
   private Textresult = new Text();
   public void reduce(Text key, Iterable<Text> values,Context contex) 
throwsIOException, InterruptedException
    {
       StringfileList = new String();
       double sum =0 , cnt = 0;
       for (Textvalue : values)
       {
           cnt++;//統計出現的文章數
           fileList+= value.toString() + ";";//文章次數之間加分號
           intsplitIndex = value.toString().indexOf(":"); 
           sum +=Integer.parseInt(value.toString().substring(splitIndex+1));//統計出現總次數
       }
       sum /= cnt;//計算平均次數
       result.set(fileList);//設定value值
       key.set(key.toString()+ '\t' + String.format("%.2f", sum));//設定key值
       contex.write(key,result);//寫入context
    }
}
這裡最終輸出的key是“單詞平均出現次數”，
Value是“文章：出現次數;……”。

開發環境： Intellijidea + meaven + java1.8

對武俠小說集合的進行倒排索引，輸出檔案中江湖的截圖如下：

完整程式碼如下：

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.commons.lang.ObjectUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.hbase.client.HTable;


public class InvertedIndex
{
    private static Configuration conf2 = null;
    static
    {
        conf2 = HBaseConfiguration.create();
    }

    public static void addData(String tableName, String rowKey, String family,
                               String qualifier, String value )throws Exception
    {
        try
        {
            HTable table = new HTable(conf2, tableName);
            Put put = new Put(Bytes.toBytes(rowKey));
            put.add(Bytes.toBytes(family), Bytes.toBytes(qualifier), Bytes.toBytes(value));
            table.put(put);
            System.out.println("insert success!");
        }
        catch (IOException e)
        {
            e.printStackTrace();
        }
    }

    public static class Map extends Mapper<Object,Text,Text,Text>
    {
        private Text valueInfo = new Text();
        private Text keyInfo = new Text();
        private FileSplit split;
        public void map(Object key, Text value,Context context) throws IOException, InterruptedException
        {
            split = (FileSplit) context.getInputSplit();
            StringTokenizer stk = new StringTokenizer(value.toString());
            while (stk.hasMoreElements())
            {
                String name = split.getPath().getName();
                int splitIndex = name.indexOf(".");
                keyInfo.set(stk.nextToken() + ":" + name.substring(0, splitIndex));
                valueInfo.set("1");
                context.write(keyInfo, valueInfo);
            }
        }
    }

    public static class Combiner extends Reducer<Text,Text,Text,Text>
    {
        Text info = new Text();
        public void reduce(Text key, Iterable<Text> values,Context context) throws IOException, InterruptedException
        {
            int sum = 0;
            for (Text value : values)
            {
                sum += Integer.parseInt(value.toString());
            }
            int splitIndex = key.toString().indexOf(":");
            info.set(key.toString().substring(splitIndex+1) + ":" + sum);
            key.set(key.toString().substring(0,splitIndex));
            context.write(key, info);
        }
    }

    public static class Reduce extends Reducer<Text,Text,Text,Text>
    {
        private Text result = new Text();
        public void reduce(Text key, Iterable<Text> values,Context contex) throws IOException, InterruptedException
        {
            //生成文件列表
            String fileList = new String();
            double sum = 0 , cnt = 0;
            for (Text value : values)
            {
                cnt++;
                fileList += value.toString() + ";";
                int splitIndex = value.toString().indexOf(":");
                sum += Integer.parseInt(value.toString().substring(splitIndex+1));
            }
            sum /= cnt;

            result.set(fileList);
            //key.set(key.toString() + '\t' + String.format("%.2f", sum));
            try
            {
                addData("test", key.toString(), "BigData", "aveNum", String.format("%.2f", sum));
            }
            catch (Exception e)
            {
                e.printStackTrace();
            }
            contex.write(key, result);
        }
    }

    public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException
    {
        Configuration conf = new Configuration();//配置物件
        Job job = new Job(conf,"InvertedIndex");//新建job
        job.setJarByClass(InvertedIndex.class);//job類

        job.setMapperClass(Map.class);//map設定
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Text.class);

        job.setCombinerClass(Combiner.class);//combiner設定

        job.setReducerClass(Reduce.class);//reduce設定
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);

        //FileInputFormat.addInputPath(job, new Path("/data/wuxia_novels/"));//路徑設定
        //FileOutputFormat.setOutputPath(job, new Path("/user/2016st28/exp2/"));
        FileInputFormat.addInputPath(job, new Path("/input/exp2/"));//路徑設定
        FileOutputFormat.setOutputPath(job, new Path("/output/test/"));

        System.exit(job.waitForCompletion(true)?0:1);
    }
}

檔案倒排索引演算法及其hadoop實現

什麼是檔案的倒排索引？簡單講就是一種搜尋引擎的演算法。過倒排索引，可以根據單詞快速獲取包含這個單詞的文件列表。倒排索引主要由兩個部分組成：“單詞”和對應出現的“倒排檔案”。 MapReduce的設計思路整個過程包含map、combiner、reduce三個階段，

elasticsearch核心知識---52.倒排索引組成結構以及實現TF-IDF演算法

首先實現了採用java 簡易的實現TF-IDF演算法package matrixOnto.Ja_9_10_va; import com.google.common.base.Preconditions; import org.nutz.lang.Strings; impo

MapReduce--帶有詞頻統計的倒排索引演算法

倒排索引：根據單詞來查詢文件實現：單詞1 文件1：次數，文件2：次數，文件5：次數單詞1 平均次數單詞2 文件3：次數，文件6：次數單詞2 平均次數 Mapper: 輸出： key: term- ->docid value: 1 public static cla

搜尋引擎入門 --- 倒排索引演算法

搜尋引擎的索引 1.單詞——文件矩陣單詞-文件矩陣是表達兩者之間所具有的一種包含關係的概念模型，圖3-1展示了其含義。圖3-1的每列代表一個文件，每行代表一個單詞，打對勾的位置代表包含關係。

【大資料】實驗三文件倒排索引演算法

實驗三文件倒排索引演算法 151220129 計科吳政億 [email protected] 151220130 計科伍昱名 [email protected] 151220135 計科許麗軍 [email prote

Hadoop鏈式MapReduce、多維排序、倒排索引、自連線演算法、二次排序、Join效能優化、處理員工資訊Join實戰、URL流量分析、TopN及其排序、求平均值和最大最小值、資料清洗ETL、分析氣

Hadoop Mapreduce 演算法彙總第52課：Hadoop鏈式MapReduce程式設計實戰...1 第51課：Hadoop MapReduce多維排序解析與實戰...2 第50課：HadoopMapReduce倒排索引解析與實戰...3 第49課：Hado

倒排索引構建演算法SPIMI（已實現，修訂版）

TA011121600045170###347###A0###2###20111214213127###86b4bc20eb98b1eb21932ebf5dcfcca5###1###蘭州###空氣質量# TA011121600045168###347###A0###2###20111215181000###e

Hadoop環境搭建及實現倒排索引

目錄 1.應用介紹 3 1.1實驗環境介紹 3 1.2應用背景介紹 3 1.3應用的意義與價值 4 2.資料及儲存 5 2.1資料來源及資料量 5 2.2資料儲存解決方案 5 3.分析處理架構 5 3.1架構設計和處理方法

Hadoop 之 MapReduce 的工作原理及其倒排索引的建立

一、Hadoop 簡介下面先從一張圖理解MapReduce得整個工作原理下面對上面出現的一些名詞進行介紹ResourceManager：是YARN資源控制框架的中心模組，負責叢集中所有的資源的統一管理和分配。它接收來自NM(NodeManager)的彙報，建立AM，

一些演算法的MapReduce實現——倒排索引實現

/** * input format * docid<tab>doc content * * output format * (term:docid)<tab>(tf in this doc) * */ public s

Hadoop 文件倒排索引實現

在上黃宜華老師的MapReduce的課程中，會有實驗讓實現帶詞頻的文件倒排索引。一般情況下根據他的書就能實現基本的東西，但是根據書上的程式碼，執行的時候可能會有一些小的trick，會報出一些異常。其實如果參照這個文章《Hadoop之倒排索引》就能實現所需要的功能了。但是本

使用Hadoop 實現文件倒排索引

文件倒排索引主要是統計每個單詞在各個文件中出現的頻數，因此要以單詞為key，value為文件以及該單詞在此文件頻數，即輸出資料的格式形如： < word1,[doc1,3] [doc2,4] ... > :表示word1這個單詞在doc1文

Hadoop學習之網路爬蟲+分詞+倒排索引實現搜尋引擎案例

本專案實現的是：自己寫一個網路爬蟲，對搜狐(或者csdn)爬取新聞(部落格)標題,然後把這些新聞標題和它的連結地址上傳到hdfs多個檔案上，一個檔案對應一個標題和連結地址，然後通過分詞技術對每個檔案中的標題進行分詞，分詞後建立倒排索引以此來實現搜尋引擎的功能，建

Lucene倒排索引原理與實現:Term Dictionary和Index檔案 (FST詳細解析)

我們來看最複雜的部分，就是Term Dictionary和Term Index檔案，Term Dictionary檔案的字尾名為tim，Term Index檔案的字尾名是tip，格式如圖所示。 Term Dictionary檔案首先是一個Header，接下來是Pos

我愛分享----百萬商業圈C語言實現的倒排索引算法(含全部源碼)

db4 cover cst via com deb nio main 20M PAT-1134VertexCover（圖的建立+set容器）刷題——POJ2395OutofHay QGC之QGCView.qml HDU-2049不容易系列之四（考新郎） 2e5訟矣屎htt

ElasticSearch教程——倒排索引及其資料結構以及優缺點

ElasticSearch彙總請檢視：ElasticSearch教程——彙總篇倒排索引 Elasticsearch 使用一種稱為倒排索引的結構，它適用於快速的全文搜尋。一個倒排索引由文件中所有不重複詞的列表構成，對於其中每個詞，有一個包含它的文件列

倒排索引 mr實現

Map階段 <0,"this is google"> .... context.write("google ->a.txt",1); context.write("google -&g

ElasticSearch最佳入門實踐（六十六）倒排索引組成結構以及其索引可變原因

倒排索引，是適合用於進行搜尋的倒排索引的結構（1）包含這個關鍵詞的document list （2）包含這個關鍵詞的所有document的數量：IDF（inverse document frequency）（3）這個關鍵詞在每個document中出現的次數：TF（ter

倒排索引原理和實現

轉載https://blog.csdn.net/u011239443/article/details/60604017 倒排索引原理和實現關於倒排索引場景是：給定幾個關鍵詞，找出包含關鍵詞的文件倒排索引：不是由記錄來確定屬性值，而是由屬性值來確定記錄的位置

Lucene全文檢索之倒排索引實現原理、API解析【2018.11】

》官網 http://lucene.apache.org/ 下載地址：https://mirrors.tuna.tsinghua.edu.cn/apache/lucene/java/7.5.0/ 》 Lucene的全文檢索是指什麼：程式掃描文件

檔案倒排索引演算法及其hadoop實現

MapReduce的設計思路

相關推薦