結合案例講解MapReduce重要知識點 --------- 多表連線

阿新 • • 發佈：2018-12-20

第一張表的內容：

login：
uid	sexid	logindate
1	1	2017-04-17 08:16:20
2   2	2017-04-15 06:18:20
3   1	2017-04-16 05:16:24
4   2	2017-04-14 03:18:20
5   1	2017-04-13 02:16:25
6   2	2017-04-13 01:15:20
7   1	2017-04-12 08:16:34
8   2	2017-04-11 09:16:20
9   0	2017-04-10 05:16:50

第二張表的內容：

sex：
0	不知道
1	男
2	女

第三張表的內容：

user uname
1	小紅
2   小行
3   小通
4   小閃
5   小鎮
6   小振
7   小秀
8   小微
9   小懂
10	小明
11  小剛
12  小舉
13  小黑
14  小白
15  小鵬
16  小習

最終輸出效果：

loginuid	 sex		uname	logindate
1		男	            小紅	 2017-04-17 08:16:20
2        女	 			小行	  2017-04-15 06:18:20
3        男	 			小通	  2017-04-16 05:16:24
4        女	 			小閃	  2017-04-14 03:18:20
5        男	 			小鎮	  2017-04-13 02:16:25
6        女	 			小振	  2017-04-13 01:15:20
7        男	 			小秀	  2017-04-12 08:16:34
9       不知道			   小微	2017-04-10 05:16:50
8       女	 			小懂	  2017-04-11 09:16:20

思路：

map端join：map端join

核心思想：將小表文件快取到分散式快取中，然後再map端進行連線處理。

適用場景：有一個或者多個小表 和 一個或者多個大表檔案。

優點：map端使用記憶體快取小表資料，載入速度快；大大減少map端到reduce端的傳輸量；大大較少shuffle過程耗時。

缺點：解決的業務需要有小表。

semi join：半連線

解決map端的缺點，當多個大檔案同時存在，且一個大檔案中有效資料抽取出來是小檔案時，

則可以單獨抽取出來並快取到分散式快取中，然後再使用map端join來進行連線。

自定義一個writable類User

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import org.apache.hadoop.io.Writable;

/**
 * user 資訊bean
 * @author lyd
 *
 */
public class User implements Writable{

	public String uid;
	public String uname;
	public String gender;
	public String ldt;
	
	public User(){
		
	}
	
	public User(String uid, String uname, String gender, String ldt) {
		this.uid = uid;
		this.uname = uname;
		this.gender = gender;
		this.ldt = ldt;
	}

	@Override
	public void write(DataOutput out) throws IOException {
		out.writeUTF(uid);
		out.writeUTF(uname);
		out.writeUTF(gender);
		out.writeUTF(ldt);
	}

	@Override
	public void readFields(DataInput in) throws IOException {
		this.uid = in.readUTF();
		this.uname = in.readUTF();
		this.gender = in.readUTF();
		this.ldt = in.readUTF();
	}

	/**
	 * @return the uid
	 */
	public String getUid() {
		return uid;
	}

	/**
	 * @param uid the uid to set
	 */
	public void setUid(String uid) {
		this.uid = uid;
	}

	/**
	 * @return the uname
	 */
	public String getUname() {
		return uname;
	}

	/**
	 * @param uname the uname to set
	 */
	public void setUname(String uname) {
		this.uname = uname;
	}

	/**
	 * @return the gender
	 */
	public String getGender() {
		return gender;
	}

	/**
	 * @param gender the gender to set
	 */
	public void setGender(String gender) {
		this.gender = gender;
	}

	/**
	 * @return the ldt
	 */
	public String getLdt() {
		return ldt;
	}

	/**
	 * @param ldt the ldt to set
	 */
	public void setLdt(String ldt) {
		this.ldt = ldt;
	}

	/* (non-Javadoc)
	 * @see java.lang.Object#toString()
	 */
	@Override
	public String toString() {
		return uid + "\t" + uname + "\t" + gender + "\t" + ldt;
	}
}

MapReduce類MultipleTableJoin

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.IOException;
import java.net.URI;
import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.Reducer.Context;
import org.apache.hadoop.mapreduce.filecache.DistributedCache;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;
public class MultipleTableJoin extends ToolRunner implements Tool{

	/**
	 * 自定義的myMapper
	 * @author lyd
	 *
	 */
	static class MyMapper extends Mapper<LongWritable, Text, User, NullWritable>{

		Map<String,String> sexMap = new ConcurrentHashMap<String, String>();
		Map<String,String> userMap = new ConcurrentHashMap<String, String>();
		
		//讀取快取檔案
		@Override
		protected void setup(Context context)throws IOException, InterruptedException {
			Path [] paths = DistributedCache.getLocalCacheFiles(context.getConfiguration());
			for (Path p : paths) {
				String fileName = p.getName();
				if(fileName.equals("sex")){//讀取 “性別表”
					BufferedReader sb = new BufferedReader(new FileReader(new File(p.toString())));
					String str = null;
					while((str = sb.readLine()) != null){
						String []  strs = str.split("\t");
						sexMap.put(strs[0], strs[1]);
					}
					sb.close();
				} else if(fileName.equals("user")){//讀取“使用者表”
					BufferedReader sb = new BufferedReader(new FileReader(new File(p.toString())));
					String str = null;
					while((str = sb.readLine()) != null){
						String []  strs = str.split("\t");
						userMap.put(strs[0], strs[1]);
					}
					sb.close();
				}
			}
		}

		@Override
		protected void map(LongWritable key, Text value,Context context)
				throws IOException, InterruptedException {
			
			String line = value.toString();
			String lines [] = line.split("\t");
			String uid = lines[0];
			String sexid = lines[1];
			String logindate = lines[2];
			
			//join連線操作
			if(sexMap.containsKey(sexid) && userMap.containsKey(uid)){
				String uname = userMap.get(uid);
				String gender = sexMap.get(sexid);
				//User user = new User(uid, uname, gender, logindate);
				//context.write(new Text(uid+"\t"+uname+"\t"+gender+"\t"+logindate), NullWritable.get());
				User user = new User(uid, uname, gender, logindate);
				context.write(user, NullWritable.get());
			}	
		}

		@Override
		protected void cleanup(Context context)throws IOException, InterruptedException {
		}
	}
	
	/**
	 * 自定義MyReducer
	 * @author lyd
	 *
	 */
	/*static class MyReducer extends Reducer<Text, Text, Text, Text>{

		@Override
		protected void setup(Context context)throws IOException, InterruptedException {
		}
		
		@Override
		protected void reduce(Text key, Iterable<Text> value,Context context)
				throws IOException, InterruptedException {
		}
		
		@Override
		protected void cleanup(Context context)throws IOException, InterruptedException {
		}
	}*/
	
	
	@Override
	public void setConf(Configuration conf) {
		conf.set("fs.defaultFS", "hdfs://hadoop01:9000");
	}

	@Override
	public Configuration getConf() {
		return new Configuration();
	}
	
	/**
	 * 驅動方法
	 */
	@Override
	public int run(String[] args) throws Exception {
		//1、獲取conf物件
		Configuration conf = getConf();
		//2、建立job
		Job job = Job.getInstance(conf, "model01");
		//3、設定執行job的class
		job.setJarByClass(MultipleTableJoin.class);
		//4、設定map相關屬性
		job.setMapperClass(MyMapper.class);
		job.setMapOutputKeyClass(User.class);
		job.setMapOutputValueClass(NullWritable.class);
		FileInputFormat.addInputPath(job, new Path(args[0]));
		
		//設定快取檔案  
		job.addCacheFile(new URI(args[2]));
		job.addCacheFile(new URI(args[3]));
		
//		URI [] uris = {new URI(args[2]),new URI(args[3])};
//		job.setCacheFiles(uris);
		
	/*	DistributedCache.addCacheFile(new URI(args[2]), conf);
		DistributedCache.addCacheFile(new URI(args[3]), conf);*/
		
		/*//5、設定reduce相關屬性
		job.setReducerClass(MyReducer.class);
		job.setOutputKeyClass(Text.class);
		job.setOutputValueClass(Text.class);*/
		//判斷輸出目錄是否存在，若存在則刪除
		FileSystem fs = FileSystem.get(conf);
		if(fs.exists(new Path(args[1]))){
			fs.delete(new Path(args[1]), true);
		}
		FileOutputFormat.setOutputPath(job, new Path(args[1]));
		
		//6、提交執行job
		int isok = job.waitForCompletion(true) ? 0 : 1;
		return isok;
	}
	
	/**
	 * job的主入口
	 * @param args
	 */
	public static void main(String[] args) {
		try {
			//對輸入引數作解析
			String [] argss = new GenericOptionsParser(new Configuration(), args).getRemainingArgs();
			System.exit(ToolRunner.run(new MultipleTableJoin(), argss));
		} catch (Exception e) {
			e.printStackTrace();
		}
	}
}

結合案例講解MapReduce重要知識點 --------- 多表連線

第一張表的內容： login： uid sexid logindate 1 1 2017-04-17 08:16:20 2 2 2017-04-15 06:18:20 3 1 2017-04-16 05:16:24 4 2 2017-04-14 03:18:20

結合案例講解MapReduce重要知識點 --------- 簡單排序

import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.

結合案例講解MapReduce重要知識點 -------- 記憶體排序

TOP N 資料： hello qianfeng hello qianfeng qianfeng is best qianfeng better hadoop is good spark is nice 取統計後的前三名： qianfeng 4 is

結合案例講解MapReduce重要知識點 ----------- 自定義MapReduce資料型別（1）重寫Writable介面

重寫Writable介面如下程式碼就是自定義mr資料型別，在wordcount類使用它。 WordCountWritable import java.io.DataInput; import java.io.DataOutput; import java.io.IOE

結合案例講解MapReduce重要知識點 -------- 使用自定義資料實現記憶體排序

自定義資料WCData import java.io.DataInput; import java.io.DataOutput; import java.io.IOException; import org.apache.hadoop.io.WritableComparab

結合案例講解MapReduce重要知識點 ------- 使用自定義MapReduce資料型別實現二次排序

自定義資料型別SSData import java.io.DataInput; import java.io.DataOutput; import java.io.IOException; import org.apache.hadoop.io.WritableCompa

大資料_Shuffle、MapReduce程式設計案例(資料去重、多表查詢、倒排索引、使用單元測試)

一、什麼是Shuffle（洗牌） ----> MapReduce核心 1、序列化 2、排序 3、分割槽 4、合併二、MapReduce程式設計案例 ------> 掌握方法：如何開發一個程式 1、資料

Kettle案例總結一—多表連線(記錄集連線)

Kettle是一款國外開源的ETL工具，純java編寫，可以在Window、Linux、Unix上執行，資料抽取高效穩定。 Kettle這個ETL工具集，它允許你管理來自不同資料庫的資料，通過提供一個圖形化的使用者環境來描述你想做什麼，而不是你想怎麼做。 Kettle中有兩種

資料庫sql語句多表連線查詢+group by分組的使用

參考自：https://blog.csdn.net/fly_fly_fly_pig/article/details/81325116 平時用sql查詢經常遇到的問題，這次搜到了一個博主的文章，解決了問題。但是其中的深層原因還沒有想清楚，本文需要完善。更正前 CREATE VIE

詳解MySQL的多表連線查詢

前期準備工作在這裡我準備了一個簡單的省市縣的mysql資料庫，進行簡單的案例分析 create table province( pro_id int primary key, pro_name varchar(10) ); insert into provinc

17、多表連線查詢

學習目標： 1、掌握自然連線、左外連線、右外連線和全連線的概念 2、掌握Oracle對自然連線、左外連線、右外連線和全連線的語法的支援 3、熟練掌握多表連線查詢學習過程：有時候我們需要從多張表中獲取資料，select語句支援一次性查詢多張表，這些表在記憶體中會做一個“乘法”操

多表連線查詢、子查詢

多表連線查詢當我們在資料庫的查詢中，可能我們需要的兩個或多個欄位並不存在與一張表中，我們可以通過多表連線查詢的方式進行查詢（雖然我們可以寫不同的幾個語句分開進行查詢，但是這樣會極大的增加我們的程式碼量並且效率較低）。當我們要查詢不同表內的資料時，我們需要將兩個表通過一個欄位來進行連線（一般為主

MySQL多表連線查詢

多表查詢：當查詢結果來自多張資料表的時，就需要用到連線查詢。多表連線查詢：會出現笛卡爾積的現象：a表有m行，b表有n行，查詢結果=m*n行，消除笛卡爾積現象就必須加上關聯條件，關聯條件的個數=n個表-1。多表連線查詢按照能分類為： 1、內連線：

Django高階進階03--模型的多表連線

Django框架學習03— 一.知識回顧 1.模型 (1)欄位型別 ·AutoField·一個根據實際ID自動增長的IntegerField，通常不指定。如果不指定，一個主鍵欄位將自動新增到模型中 ·CharField(max_length=

Linq 多表連線查詢join

在查詢語言中，通常需要使用聯接操作。在 LINQ 中，可以通過 join 子句實現聯接操作。join 子句可以將來自不同源序列，並且在物件模型中沒有直接關係(資料庫表之間沒有關係)的元素相關聯，唯一的要求是每個源中的元素需要共享某個可以進行比較，以判斷是否相等的值。在 LINQ&nb

Mysql---複合查詢(多表連線、自連線、子查詢(any all) from子句查詢、union)

本篇部落格對錶的操作基於以下幾個表：首先了解下簡單查詢即對一個表的查詢： 1.員工資訊表emp mysql> select * from emp; 2.公司部門資訊表dept（部門號、部門名稱、位置） mysql> select * from dept;

Oracle 與 Mysql 多表連線對比

oracle： select a.aid aid , b.bid bid ,c.cid cid , d.did did , e.eid eid fromtablea a , tableb

MySQL 多表連線查詢練習 (四)

測試用資料來源於Oracle資料庫中的測試資料MySQL資料庫表: employees員工表,departments部門表, locations地址表#----表連線查詢練習 SELECT * FROM departments; select * from departme

Oracle-查詢-多表連線查詢

左外連線是以join左邊作為主表，右連線以join右邊做為主表外連線查詢出來的結果相當於兩個部分，一個部分是交集部分（相當於利用等值活非等值連線查詢出來的結果），另外一個部分是連線條件主表中有而從表中沒有的部分（這一部分顯示的連線條件為null，這一部分是等值活非等值連線不能滿足的）

MySQL多表連線操作

1 select * from userinfo ,dapartment where userinfo.part_id = dapartment.id; 2 --左連線: 左邊全部顯示 3 select * from userinfo left join dapartment on

結合案例講解MapReduce重要知識點 --------- 多表連線

相關推薦