大數據入門第七天——MapReduce詳解

阿新 • • 發佈：2018-01-30

使用 sys distrib sent 作業 asi users tor war

一、概述

　　1.map-reduce是什麽

Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

A MapReduce job usually splits the input data
-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework and the Hadoop Distributed File System (see HDFS Architecture Guide) are running on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting
in very high aggregate bandwidth across the cluster.

The MapReduce framework consists of a single master ResourceManager, one worker NodeManager per cluster-node, and MRAppMaster per application (see YARN Architecture Guide).

Minimally, applications specify the input/output locations and supply map and reduce functions via implementations of appropriate interfaces and/or abstract-classes. These, and other job parameters, comprise the job configuration.

The Hadoop job client
then submits the job (jar/executable etc.) and configuration to the ResourceManager which then assumes the responsibility of distributing the software/configuration to the workers, scheduling tasks and monitoring them, providing status and diagnostic information to the job-client.

Although the Hadoop framework is implemented in Java™, MapReduce applications need not be written in Java.

Hadoop Streaming is a utility which allows users to create and run jobs with any executables (e.g. shell utilities) as the mapper and/or the reducer.

Hadoop Pipes is a SWIG-compatible C++ API to implement MapReduce applications (non JNI™ based).

官網原文

　　中文翻譯：

概觀

　　Hadoop MapReduce是一個用於輕松編寫應用程序的軟件框架，它以可靠的容錯方式在大型群集（數千個節點）的商品硬件上並行處理海量數據（多TB數據集）。
MapReduce 作業通常將輸入數據集分割為獨立的塊，由地圖任務以完全平行的方式進行處理。框架對映射的輸出進行排序，然後輸入到reduce任務。通常，作業的輸入和輸出都存儲在文件系統中。該框架負責調度任務，監視它們並重新執行失敗的任務。
通常，計算節點和存儲節點是相同的，即MapReduce框架和Hadoop分布式文件系統（請參閱HDFS體系結構指南）在同一組節點上運行。此配置允許框架在數據已經存在的節點上有效地調度任務，從而在整個群集中帶來非常高的聚合帶寬。
MapReduce框架由單個主資源管理器，每個集群節點的一個工作者NodeManager和每個應用程序的MRAppMaster組成（參見YARN體系結構指南）。
最小程度上，應用程序通過實現適當的接口和/或抽象類來指定輸入/輸出位置並提供映射和減少函數。這些和其他作業參數組成作業配置。
然後，Hadoop 作業客戶端將作業（jar /可執行文件等）和配置提交給ResourceManager，然後負責將軟件/配置分發給工作人員，安排任務並對其進行監控，向作業提供狀態和診斷信息客戶。
雖然Hadoop框架是用Java™實現的，但MapReduce應用程序不需要用Java編寫。
　　Hadoop Streaming是一個實用程序，它允許用戶使用任何可執行文件（例如shell實用程序）作為映射器和/或reducer來創??建和運行作業。
Hadoop Pipes是SWIG兼容的C ++ API來實現MapReduce應用程序（基於非JNI™）。

　　　　用網友的小結來說：

　　MapReduce的處理過程分為兩個步驟：map和reduce。每個階段的輸入輸出都是key-value的形式，key和value的類型可以自行指定。map階段對切分好的數據進行並行處理，處理結果傳輸給reduce，由reduce函數完成最後的匯總。

大數據入門第七天——MapReduce詳解

使用 sys distrib sent 作業 asi users tor war 一、概述　　1.map-reduce是什麽 Hadoop MapReduce is a software framework for easily writing applica

大數據入門第七天——MapReduce詳解

一、概述

大數據入門第七天——MapReduce詳解

大數據入門第七天——MapReduce詳解（下）

大數據入門第八天——MapReduce詳解（三）

大數據入門第零天——總體課程體系概述

大數據入門第十三天——離線綜合案例：網站點擊流數據分析

大數據入門第十七天——storm上遊數據源之kafka詳解（一）入門

大數據入門第十五天——HBase整合：雲筆記項目

大數據入門第十九天——推薦系統與mahout（一）入門與概述

大數據hadoop入門之hadoop家族詳解

【大數據分析】學習之路詳解

Spring Boot入門第三天：配置日誌系統和Druid數據庫連接池。

Spring入門第七課

Linux入門第四天——shell基礎

NoSQL入門第三天——大綱帶更新

MySQL入門第三天——函數與存儲過程

大數據學習之十二——MapReduce代碼實例：關聯性操作

大數據筆記（九）——Mapreduce的高級特性（B）

大數據入門第二十天——scala入門（二）scala基礎

java入門第七篇-JQuery；

Android入門第七篇之ListView 二

大數據入門第七天——MapReduce詳解

一、概述

相關推薦