1. 程式人生 > >大數據入門第七天——MapReduce詳解

大數據入門第七天——MapReduce詳解

使用 sys distrib sent 作業 asi users tor war

一、概述

  1.map-reduce是什麽

技術分享圖片
Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.

A MapReduce job usually splits the input data
-set into independent chunks which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks. Typically the compute nodes and the storage nodes are the same, that is, the MapReduce framework and the Hadoop Distributed File System (see HDFS Architecture Guide) are running on the same set of nodes. This configuration allows the framework to effectively schedule tasks on the nodes where data is already present, resulting
in very high aggregate bandwidth across the cluster. The MapReduce framework consists of a single master ResourceManager, one worker NodeManager per cluster-node, and MRAppMaster per application (see YARN Architecture Guide). Minimally, applications specify the input/output locations and supply map and reduce functions via implementations of appropriate interfaces and/or abstract-classes. These, and other job parameters, comprise the job configuration. The Hadoop job client
then submits the job (jar/executable etc.) and configuration to the ResourceManager which then assumes the responsibility of distributing the software/configuration to the workers, scheduling tasks and monitoring them, providing status and diagnostic information to the job-client. Although the Hadoop framework is implemented in Java™, MapReduce applications need not be written in Java. Hadoop Streaming is a utility which allows users to create and run jobs with any executables (e.g. shell utilities) as the mapper and/or the reducer. Hadoop Pipes is a SWIG-compatible C++ API to implement MapReduce applications (non JNI™ based).
官網原文

  中文翻譯:

概觀

  Hadoop MapReduce是一個用於輕松編寫應用程序的軟件框架,它以可靠的容錯方式在大型群集(數千個節點)的商品硬件上並行處理海量數據(多TB數據集)。
MapReduce 作業通常將輸入數據集分割為獨立的塊,由地圖任務以完全平行的方式進行處理。框架對映射的輸出進行排序,然後輸入到reduce任務。通常,作業的輸入和輸出都存儲在文件系統中。該框架負責調度任務,監視它們並重新執行失敗的任務。
通常,計算節點和存儲節點是相同的,即MapReduce框架和Hadoop分布式文件系統(請參閱HDFS體系結構指南)在同一組節點上運行。此配置允許框架在數據已經存在的節點上有效地調度任務,從而在整個群集中帶來非常高的聚合帶寬。
MapReduce框架由單個主資源管理器,每個集群節點的一個工作者NodeManager和每個應用程序的MRAppMaster組成(參見YARN體系結構指南)。
最小程度上,應用程序通過實現適當的接口和/或抽象類來指定輸入/輸出位置並提供映射和減少函數。這些和其他作業參數組成作業配置。
然後,Hadoop 作業客戶端將作業(jar /可執行文件等)和配置提交給ResourceManager,然後負責將軟件/配置分發給工作人員,安排任務並對其進行監控,向作業提供狀態和診斷信息客戶。
雖然Hadoop框架是用Java™實現的,但MapReduce應用程序不需要用Java編寫。
  Hadoop Streaming是一個實用程序,它允許用戶使用任何可執行文件(例如shell實用程序)作為映射器和/或reducer來創??建和運行作業。
Hadoop Pipes是SWIG兼容的C ++ API來實現MapReduce應用程序(基於非JNI™)。

    用網友的小結來說:

  MapReduce的處理過程分為兩個步驟:map和reduce。每個階段的輸入輸出都是key-value的形式,key和value的類型可以自行指定。map階段對切分好的數據進行並行處理,處理結果傳輸給reduce,由reduce函數完成最後的匯總。

大數據入門第七天——MapReduce詳解