1. 程式人生 > >用python模擬hadoop的map reduce過程

用python模擬hadoop的map reduce過程

2012-06-25

用python簡單模擬hadoop的map reduce過程,便於對hadoop工作機制進行理解。

簡單來說,map reduce過程是:
給出一個(key,value)的列表list1,分析完後得到另一個想要的(key,value)列表
list1(k1,v1)->map ->list2(k2,v2)->sort,combine,shuffle->list3(k3,list(v3))->reduce-> list4(k4,v4)

在此用python程式碼模擬整個過程,但因為沒有分散式,省去combine和shuffle過程。實現字數統計功能。
程式可以在沒有hadoop的python環境中執行,包含如下幾個檔案:
mymapred.py mymap.py mysort.py myreduce.py test.txt

1.主程式:

[[email protected] hadoop-1.0.3]$ cat mymapred.py
#!/bin/env python
# author zhouhh
# http://abloz.com
# date:2012.6.25

#$ cat test.txt
#a b c d
#a b c d
#aa bb cc dd
#ee ff gg hh
#foo foo quux labs foo bar quux

import sys

filename = 'test.txt'

if len(sys.argv) == 2:
    filename= sys.argv[1]


lines = open(filename,'r').readlines()
for line in lines:
    print line.strip()

2.map程式

[zh[email protected] hadoop-1.0.3]$ cat mymap.py
#!/bin/env python
# author zhouhh
# http://abloz.com
# date:2012.6.25
import sys
def mapword(w):
    print "%st%d"%(w,1)
for line in sys.stdin:
    line = line.strip()

    words = line.split()
    m = map(mapword,words)

3.排序

[[email protected]
hadoop-1.0.3]$ cat mysort.py #!/bin/env python # author zhouhh # http://abloz.com # date:2012.6.25 import sys def mapword(w): print "%st%d"%(w,1) m = [] for line in sys.stdin: line = line.strip() word,count = line.split('t') m.append((word,count)) m = sorted(m) for i in m: print "%st%s"%i

4.reduce

[[email protected] hadoop-1.0.3]$ cat myreduce.py
#!/usr/bin/env python
# author zhouhh
# http://abloz.com
# date:2012.6.25
import sys

current_word = None
current_count = 0
word = None

for line in sys.stdin:
    line = line.strip()

    word, count = line.split('t', 1)

    try:
        count = int(count)
    except ValueError:
        continue

    if current_word == word:
        current_count += count
    else:
        if current_word:
            print '%st%s' % (current_word, current_count)
        current_count = count
        current_word = word

if current_word == word:
    print '%st%s' % (current_word, current_count)

5.測試檔案,可以自己指定。

[[email protected] hadoop-1.0.3]$ ./mymapred.py test.txt<br></br>a b c d<br></br>a b c d<br></br>aa bb cc dd<br></br>ee ff gg hh<br></br>foo foo quux labs foo bar quux

6.執行:

[[email protected] hadoop-1.0.3]$ ./mymapred.py test.txt | ./mymap.py | ./mysort.py | ./myreduce.py<br></br>a       2<br></br>aa      1<br></br>b       2<br></br>bar     1<br></br>bb      1<br></br>c       2<br></br>cc      1<br></br>d       2<br></br>dd      1<br></br>ee      1<br></br>ff      1<br></br>foo     3<br></br>gg      1<br></br>hh      1<br></br>labs    1<br></br>quux    2

7.分步執行:

[[email protected] hadoop-1.0.3]$ ./mymapred.py test.txt<br></br>a b c d<br></br>a b c d<br></br>aa bb cc dd<br></br>ee ff gg hh<br></br>foo foo quux labs foo bar quux

執行map

[[email protected] hadoop-1.0.3]$ ./mymapred.py test.txt | ./mymap.py<br></br>a       1<br></br>b       1<br></br>c       1<br></br>d       1<br></br>a       1<br></br>b       1<br></br>c       1<br></br>d       1<br></br>aa      1<br></br>bb      1<br></br>cc      1<br></br>dd      1<br></br>ee      1<br></br>ff      1<br></br>gg      1<br></br>hh      1<br></br>foo     1<br></br>foo     1<br></br>quux    1<br></br>labs    1<br></br>foo     1<br></br>bar     1<br></br>quux    1

執行排序

[[email protected] hadoop-1.0.3]$ ./mymapred.py test.txt | ./mymap.py | ./mysort.py<br></br>a       1<br></br>a       1<br></br>aa      1<br></br>b       1<br></br>b       1<br></br>bar     1<br></br>bb      1<br></br>c       1<br></br>c       1<br></br>cc      1<br></br>d       1<br></br>d       1<br></br>dd      1<br></br>ee      1<br></br>ff      1<br></br>foo     1<br></br>foo     1<br></br>foo     1<br></br>gg      1<br></br>hh      1<br></br>labs    1<br></br>quux    1<br></br>quux    1

最後執行reduce

[[email protected] hadoop-1.0.3]$ ./mymapred.py test.txt ./mymap.py ./mysort.py ./myreduce.py

如非註明轉載, 均為原創. 本站遵循知識共享CC協議,轉載請註明來源