MapReduce 倒排索引的實現

阿新 • • 發佈：2018-12-22

package cheryl.dhcc.mapreduce;

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import 
 org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Mapper.Context;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

import 
 cheryl.hadoop.WordCount;
import cheryl.hadoop.WordCount.Combiner;
import cheryl.hadoop.WordCount.IntSumReducer;
import cheryl.hadoop.WordCount.TokenizerMapper;
import cheryl.hadooputil.TransformtoUtf8;
//倒排索引 查出單詞所在資料夾
public class InvertedIndex {
    // 單詞統計 輸出的型別為text
        public static class TokenizerMapper 
 extends Mapper<Object, Text, Text, Text> {
            // 對輸入的檔案進行處理
            private final static Text one = new Text();
            private Text word = new Text();
            private FileSplit split;//目的是獲取當前檔案lujing
            private String dirName;

            // 索引號 本行文字 輸出上下文
            public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
                Text next = TransformtoUtf8.transformTextToUTF8(value, "GBK");
                split = (FileSplit) context.getInputSplit();
                dirName = split.getPath().toString();
                String line = next.toString();
                line = line.replaceAll("[^(0-9\\u4e00-\\u9fa5)]", " ");
                line = line.replaceAll("//s{2,}", " ");
                StringTokenizer itr = new StringTokenizer(line);
                while (itr.hasMoreTokens()) {
                    word.set(itr.nextToken() + "@" + dirName);
                    one.set("1");
                    context.write(word, one);
                }
            }
        }

        public static class Combiner extends Reducer<Text, Text, Text, Text> {
            private Text result = new Text();
            public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
                int sum=0;
                for (Text val : values) {
                 sum += Integer.parseInt(val.toString());
                }
                String record=key.toString();
                String[] str=record.split("@");
                key.set(str[0]);
                result.set(str[1]+"*"+Integer.toString(sum));
                context.write(key, result);
            }
        }

        public static class IntSumReducer extends Reducer<Text, Text, Text, Text> {
            private Text result = new Text();
            public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
                 String value =new String();
                    for(Text value1:values){
                        value += value1.toString()+",";
                    }
                    result.set(value);
                    context.write(key,result);
                }
        }

        public static void main(String[] args) throws Exception {
            Configuration conf = new Configuration();
            Job job = Job.getInstance(conf, "word count");
            job.setJarByClass(WordCount.class);
            job.setMapperClass(TokenizerMapper.class);
            job.setCombinerClass(Combiner.class);
            job.setReducerClass(IntSumReducer.class);
            job.setOutputFormatClass(GbkOutputFormat.class);
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(Text.class);
            FileInputFormat.addInputPath(job, new Path(args[0]));
            FileOutputFormat.setOutputPath(job, new Path(args[1]));
            System.exit(job.waitForCompletion(true) ? 0 : 1);
        } 
}

package cheryl.hadooputil;

import java.io.UnsupportedEncodingException;

import org.apache.hadoop.io.Text;

public class TransformtoUtf8 {
    //因為hadoop預設編碼為utf-8,所以在處理時要將相關檔案轉碼
        public static Text transformTextToUTF8(Text text, String encoding) {
            String value = null;
            try {
            value = new String(text.getBytes(), 0, text.getLength(), encoding);
            } catch (UnsupportedEncodingException e) {
            e.printStackTrace();
            }
            return new Text(value);
            }
}

MapReduce 倒排索引的實現

package cheryl.dhcc.mapreduce; import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.Configu

一些演算法的MapReduce實現——倒排索引實現

/** * input format * docid<tab>doc content * * output format * (term:docid)<tab>(tf in this doc) * */ public s

2018-08-03 期 MapReduce倒排索引編程案例1（Combiner方式）

pre true 輸出 hello pub 類型 rom 偏移 apr package cn.sjq.bigdata.inverted.index;import java.io.IOException;import org.apache.hadoop.conf.Config

2018-08-04 期 MapReduce倒排索引編程案例2（jobControll方式）

基本正常 org gets [] pro stat context 控制器 1、第一階段MapReduce任務程序package cn.itcast.bigdata.index;import java.io.IOException;import org.apache.ha

Lucene全文檢索之倒排索引實現原理、API解析【2018.11】

》官網 http://lucene.apache.org/ 下載地址：https://mirrors.tuna.tsinghua.edu.cn/apache/lucene/java/7.5.0/ 》 Lucene的全文檢索是指什麼：程式掃描文件

Hadoop 文件倒排索引實現

在上黃宜華老師的MapReduce的課程中，會有實驗讓實現帶詞頻的文件倒排索引。一般情況下根據他的書就能實現基本的東西，但是根據書上的程式碼，執行的時候可能會有一些小的trick，會報出一些異常。其實如果參照這個文章《Hadoop之倒排索引》就能實現所需要的功能了。但是本

Hadoop學習之網路爬蟲+分詞+倒排索引實現搜尋引擎案例

本專案實現的是：自己寫一個網路爬蟲，對搜狐(或者csdn)爬取新聞(部落格)標題,然後把這些新聞標題和它的連結地址上傳到hdfs多個檔案上，一個檔案對應一個標題和連結地址，然後通過分詞技術對每個檔案中的標題進行分詞，分詞後建立倒排索引以此來實現搜尋引擎的功能，建

MapReduce倒排索引概要

使用場景：主要用於索引，以提高搜尋資料速度例如百度搜索執行環境：windows下VM虛擬機器，centos系統，hadoop2.2.0，三節點，java 1.7 需要處理的資料為

MapReduce實現倒排索引

倒排索引這個名字讓人很容易誤解成A-Z，倒排成Z-A；但實際上缺不是這樣的。一般我們是根據問檔案來確定檔案內容，而倒排索引是指通過檔案內容來得到文件的資訊，也就是根據一些單詞判斷他在哪個檔案中。知道了這一點下面就好做了：準備一些元資料下面我們要進行兩次MapR

倒排索引的分散式實現（MapReduce程式）

package aturbo.index.inverted; import java.io.IOException; import java.util.HashSet; import org.apache.commons.lang3.StringUtils; imp

ElasticSearch倒排索引原理揭祕——基於mapreduce實現自己的倒排索引

Elasticsearch簡單介紹 Elasticsearch (ES)是一個基於Lucene構建的開源、分散式、REST

我愛分享----百萬商業圈C語言實現的倒排索引算法(含全部源碼)

db4 cover cst via com deb nio main 20M PAT-1134VertexCover（圖的建立+set容器）刷題——POJ2395OutofHay QGC之QGCView.qml HDU-2049不容易系列之四（考新郎） 2e5訟矣屎htt

mapreduce 高級案例倒排索引

大數據 hadoop mapreduce 倒排索引理解【倒排索引】的功能熟悉mapreduce 中的combine 功能根據需求編碼實現【倒排索引】的功能，旨在理解mapreduce 的功能。一：理解【倒排索引】的功能 1.1 倒排索引：由於不是根據文檔來確定文檔

大數據MapReduce入門之倒排索引

tsp 功能 nbsp bstr 生成 path 需要 turn 們的　　在上一篇博客中我們講解了MapReduce的原理以及map和reduce的作用，相信你理解了他們的原理，今天講解的是mapreduce 的另一個就是倒排索引。什麽是倒排索引呢？倒排索

倒排索引 mr實現

Map階段 <0,"this is google"> .... context.write("google ->a.txt",1); context.write("google -&g

倒排索引原理和實現

轉載https://blog.csdn.net/u011239443/article/details/60604017 倒排索引原理和實現關於倒排索引場景是：給定幾個關鍵詞，找出包含關鍵詞的文件倒排索引：不是由記錄來確定屬性值，而是由屬性值來確定記錄的位置

MapReduce框架學習（4）——倒排索引程式實戰

參考： JeffreyZhou的部落格園《Hadoop權威指南》第四版 0 倒排索引（Inverted Index）前面我們執行過WordCount例子，得到的單詞計數結果，如果輸入3篇文件，得到

MapReduce--帶有詞頻統計的倒排索引演算法

倒排索引：根據單詞來查詢文件實現：單詞1 文件1：次數，文件2：次數，文件5：次數單詞1 平均次數單詞2 文件3：次數，文件6：次數單詞2 平均次數 Mapper: 輸出： key: term- ->docid value: 1 public static cla

MapReduce入門（三）倒排索引

什麼是倒排索引？倒排索引源於實際應用中需要根據屬性的值來查詢記錄。這種索引表中的每一項都包括一個屬性值和具有該屬性值的各記錄的地址。由於不是由記錄來確定屬性值，而是由屬性值來確定記錄的位置，因而稱為倒排索引(inverted index)。帶有倒排索

python 實現倒排索引

程式碼如下： #encoding:utf-8 fin = open('1.txt', 'r') ''' 建立正向索引: “文件1”的ID > 單詞1：出現位置列表；單詞2：出現位置列表；…

MapReduce 倒排索引的實現

相關推薦