1. 程式人生 > >[翻譯]大資料處理的趨勢-五種開源技術介紹

[翻譯]大資料處理的趨勢-五種開源技術介紹

作者:楊鑫奇

本篇文章是一篇翻譯文章,對未來大資料領域的技術進行一些前瞻性的介紹,個人感覺他寫的文章還是很好的,推薦的技術也具有的一定的代表性,遂將本篇文章翻譯出來,感興趣的大家能夠看看。

大資料領域的處理,我自己本身接觸的時間也不長,正式的專案還在開發之中,深受大資料處理方面的吸引,所以也就有寫文章的想法的了。

原文連結:


Big Data is on every CIO’s mind this quarter, and for good reason. Companies will have spent $4.3 billion on Big Data technologies by the end of 2012.
大資料由於種種原因引起CIO的廣泛關注。截止2012年底,在大資料領域公司花費的研發成本將達到43億美元。

But here’s where it gets interesting. Those initial investments will in turn trigger a domino effectof upgrades and new initiatives that are valued at $34 billion for 2013, per Gartner. Over a 5 year period, spend is estimated at $232 billion.
這就是有趣的地方。由Gartner諮詢公司提供的資料表面:這些創投公司將觸發多米若效應,這些公司在2013年總市值將達到340億美元,未來5年預估將達到2320億美元。

What you’re seeing right now is only the tip of a gigantic iceberg.
這只是看到的冰山一角。
Big Data is presently synonymous with technologies like Hadoop, and the “NoSQL” class of databases including Mongo (document stores) and Cassandra (key-values). Today it’s possible to stream real-time analytics with ease. Spinning clusters up and down is a (relative) cinch, accomplished in 20 minutes or less. We have table stakes.
大資料以Hadoop以及"NO SQL"為主的Mongo和Cassandra等資料庫技術在展現。現在資料的實時分析將可能容易一些。現在叢集的轉換將越來越可靠,20分鐘以內就能夠完成。因為我們用表來支援?
But there are new, untapped advantages and non-trivially large opportunities beyond these usual suspects.
但是這些是僅僅是一些比較新的,未開發的優點和不平凡的大機會超過了這些常規的猜想。

Did you know that there are over 250K viable open source technologies on the market today? Innovation is all around us. The increasing complexity of systems, in fact, looks something like this:
你知道麼,在現在的市場上超過25萬個開源技術出現了。圍繞在我們身邊,這些越來越複雜的系統,就像我們看到的這樣,看如下圖表:

We have a lot of…choices, to say the least.
在最少選擇的情況下我們還是有很多選擇的機會。

What’s on our own radar, and what’s coming down the pipe for Fortune 2000 companies? What new projects are the most viable candidates for production-grade usage? Which deserve your undivided attention?
哪些是你的目標?哪些是2000家公司接下來的財富?哪些專案是可以在真正的產品階段使用的作為可靠的候選?哪些應該受到特別關注呢?

We did all the research and testing so you don’t have to. Let’s look at five new technologies that are shaking things up in Big Data. Here is the newest class of tools that you can’t afford to overlook, coming soon to an enterprise near you.
我們做了詳細的研究和測試,讓我們一起看下5種新的撼動大資料的技術。這些是整理的幾組新的工具,讓我們一起來看看吧。
STORM AND KAFKA
Storm and Kafka are the future of stream processing, and they are already in use at a number of high-profile companies including Groupon, Alibaba, and The Weather Channel.
Storm 和 Kafka 是未來資料流處理的主要方式,它們已經在一些大公司中使用率餓,包括 Groupon,阿里巴巴和The Weather Channel等

Born inside of Twitter, Storm is a “distributed real-time computation system”. Storm does for real-time processing what Hadoop did for batch processing. Kafka for its part is a messaging system developed at LinkedIn to serve as the foundation for their activity stream and the data processing pipeline behind it.

Storm,誕生於Twitter,是一個分散式實時計算系統。Storm 設計用於處理實時計算,hadoop主要用於處理批處理運算。
kafka是由LinkedIn研發的一款訊息系統作為一個數據處理的管道基礎部分存在於系統中。

When paired together, you get the stream, you get it in-real time, and you get it at linear scale.
當你一起使用它們,你就能實時地和線性遞增的獲取資料。

Why should you care? 你為什麼需要關心?
With Storm and Kafka, you can conduct stream processing at linear scale, assured that every message gets processed in real-time, reliably. In tandem, Storm and Kafka can handle data velocities of tens of thousands of messages every second.
使用Storm和Kafka,使得資料流處理線性的,確保每條訊息獲取都是實時的,可靠的。前後佈置的Storm和Kafka能每秒流暢的處理10000條資料。

Stream processing solutions like Storm and Kafka have caught the attention of many enterprises due to their superior approach to ETL (extract, transform, load) and data integration.
像Storm和Kafka這樣的資料流處理方案使得很多企業引起關注並想達到優秀的ETL(抽取轉換裝載)的資料整合方案。

Storm and Kafka are also great at in-memory analytics, and real-time decision support. Companies are quickly realizing that batch processing in Hadoop does not support real-time business needs. Real-time streaming analytics is a must-have component in any enterprise Big Data solution or stack, because of how elegantly they handle the “three V’s” — volume, velocity and variety.
Storm 和 Kafka 也很擅長記憶體分析和實時決策支援。企業使用批量處理的Hadoop方案無法也難怪對實時的業務需求。在企業的大資料解決方案中實時資料流處理是一個必要的模組,因為它很優美的處理了“3v”--volume,velocity 和 variety (容量,速率和多樣性)

Storm and Kafka are the two technologies on the list that we’re most committed to at Infochimps, and it is reasonable to expect that they’ll be a formal part of our platform soon.
Storm和Kafka這2種技術是我們(infochimps)最推薦的技術,它們也將作為一個正式組成部分存在於我們的平臺中。

DRILL AND DREMEL
Drill and Dremel make large-scale, ad-hoc querying of data possible, with radically lower latencies that are especially apt for data exploration. They make it possible to scan over petabytes of data in seconds, to answer ad hoc queries and presumably, power compelling visualizations.
Drill和Dremel 實現了快速低負載的大規模,即席查詢資料搜尋。它們提供了秒級搜尋P級別資料的可能,來應對即席查詢和預測,及提供強大的虛擬化支援。

Drill and Dremel put power in the hands of business analysts, and not just data engineers. The business side of the house will love Drill and Dremel.
Drill和Dremel提供強大的業務處理能力,不僅僅只是為資料工程師提供。業務端的大家都將喜歡Drill和Dremel.

Drill is the open source version of what Google is doing with Dremel (Google also offers Dremel-as-a-Service with its BigQuery offering). Companies are going to want to make the tool their own, which why Drill is the thing to watch mostly closely. Although it’s not quite there yet, strong interest by the development community is helping the tool mature rapidly.
Drill 是Google的Dremel的開源版本。Dremel是Google提供的支援大資料查詢的技術。公司將用它來開發自己的工具,這些是導致大家都密切的關注Drill的原因。雖然這些不是起步,但是開源社群強烈的興趣使得它變得更成熟。
Why should you care? 為什麼你應該關心?
Drill and Dremel compare favorably to Hadoop for anything ad-hoc. Hadoop is all about batch processing workflows, which creates certain disadvantages.
Drill和Dremel相比Hadoop更好的分析即席查詢。Hadoop僅僅提供批量的資料處理工作流,這些也是缺點。

The Hadoop ecosystem worked very hard to make MapReduce an approachable tool for ad hoc analyses. From Sawzall to Pig and Hive, many interface layers have been built on top of Hadoop to make it more friendly, and business-accessible. Yet, for all of the SQL-like familiarity, these abstraction layers ignore one fundamental reality – MapReduce (and thereby Hadoop) is purpose-built for organized data processing (read: running jobs, or “workflows”).
Hadoop生態圈使得MapReduce作為一個很親切有利的工具應用於廣告分析。從Sawzall到Pig到Hive,很多介面層應用的建立使得Hadoop更為友好,更接近業務,但是,像SQL體系,這些抽象層忽略一個重要的事實--MapReduce(或Hadoop)是為了系統化資料處理流程而存在的。

What if you’re not worried about running jobs? What if you’re more concerned with asking questions and getting answers — slicing and dicing, looking for insights?
如果你不擔心跑的哪些任務? 如果你不關心這些產生的問題和去尋求答案,那就保持沉默,保持洞察力。

That’s “ad hoc exploration” in a nutshell — if you assume data that’s been processed already, how can you optimize for speed? You shouldn’t have to run a new job and wait, sometimes for considerable lengths of time, every time you want to ask a new question.
“即席探索" -- 如果你已經承擔資料處理,你這麼優化處理的速度?你不應該執行一個新的任務或者是等待,有時候考慮的時間還不如在問個新的問題。

In stark contrast to workflow-based methodology, most business-driven BI and analytics queries are fundamentally ad hoc, interactive, low-latency analyses. Writing Map Reduce workflows is prohibitive for many business analysts. Waiting minutes for jobs to start and hours for workflows to complete is not conducive to an interactive experience of data, the comparing and contrasting, and the zooming in and out that ultimately creates fundamentally new insights.
在堆對比的工作流基礎的方法論中,很多業務驅動的BI和分析查詢都是很基本的和臨時互動的,低延時分析。寫Map/Reduce工作流在很多業務分析中是被禁止的。等待幾分鐘等Jobs啟動,在等幾個小時等執行完成這些無溢於資料的互動體驗,這些對比,和縮放比較最終產生了基本的新的視野。

Some data scientists even speculate that Drill and Dremel may actually be better than Hadoop in the wider sense, and a potential replacement, even. That’s a little too edgy a stance to embrace right now, but there is merit in an approach to analytics that is more query-oriented and low latency.
一些資料科學家早已經推測Drill和Dremel將優於Hadoop,並達成共識,也有一些還在考慮中,還有少部分的狂熱者立即擁抱變化,但是這些是主要的優點在更面向查詢的和低延時的情況下。

At Infochimps we like the Elasticsearch full-text search engine and database for doing high-level data exploration, but for truly capable Big Data querying at the (relative) seat level, we think that Drill will become the de facto solution.
在Infochimps我們喜歡使用Elasticsearch全文索引引擎來實現資料庫的資料搜尋,但是真的在大資料處理中我們認為Drill將成為主流。

R
R is an open source statistical programming language. It is incredibly powerful. Over two million (and counting) analysts use R. It’s been around since 1997 if you can believe it. It is a modern version of the S language for statistical computing that originally came out of the Bell Labs. Today, R is quickly becoming the new standard for statistics.
R是開源的強大的統計程式語言。自1997年以來,超過200萬的統計分析師使用R。這是一門誕生自貝爾實驗室的在統計計算領域的現代版的S語言並迅速地成為了新的標準的統計語言。

R performs complex data science at a much smaller price (both literally and figuratively). R is making serious headway in ousting SAS and SPSS from their thrones, and has become the tool of choice for the world’s best statisticians (and data scientists, and analysts too).
R使得複雜的資料科學變得更廉價。R是SAS和SPASS的重要的領頭者,並作為最優秀的統計師的重要工具。

Why should you care? 為什麼你應該關心?
Because it has an unusually strong community around it, you can find R libraries for almost anything under the sun — making virtually any kind of data science capability accessible without new code. R is exciting because of who is working on it, and how much net-new innovation is happening on a daily basis. the R community is one of the most thrilling places to be in Big Data right now.
因為它有一個非凡強大的社群在支援著,你可以找到所有的R的類庫,建立虛擬的各型別的科學資料而不用新寫程式碼。R之所以令人興奮是因為維護他的人和新的每天的創造。R社群是大資料領域令人興奮的地方之一。
R is a also wonderful way to future-proof your Big Data program. In the last few months, literally thousands of new features have been introduced, replete with publicly available knowledge bases for every analysis type you’d want to do as an organization.
R在大資料領域是一個超棒的不會過時的技術。在最近的幾個月裡,幾千個新特性被日益公開的知識基礎為主的分析型別的分析師們介紹.

Also, R works very well with Hadoop, making it an ideal part of an integrated Big Data approach.
而且,R和Hadoop協同的很好,作為一個大資料的處理的部分已經被證明了。
To keep an eye on: Julia is an interesting and growing alternative to R, because it combats R’s notoriously slow language interpreter problem. The community around Julia isn’t nearly as strong right now, but if you have a need for speed…
保持關注:Julia ,是一個有趣的R的替代者,因為它不喜歡R的死慢死慢的直譯器。Julia的社群雖然不怎麼強大現在,但是如果你不是立即使用它的話,還是可以等等的。

GREMLIN AND GIRAPH Gremlin and Giraph help empower graph analysis, and are often used coupled with graph databases like Neo4j or InfiniteGraph, or in the case of Giraph, working with Hadoop. Golden Orbis another high-profile example of a graph-based project picking up steam.
Gremlin 和 Giraph 幫助增強圖形分析,並在圖資料庫像Neo4j和InfiniteGraph中被使用,和與Hadoop協同工作的Giraph中被使用。Golden Orb是另一個高層面的流處理的圖基礎的專案的例子。可以看看。
Graph databases are pretty cutting edge. They have interesting differences with relational databases, which mean that sometimes you might want to take a graph approach rather than a relational approach from the very beginning.
圖資料庫是富有魅力的邊緣化的資料庫。它們和關係型資料庫相比,有著很多有趣的不同點,這個是當你在開始的時候總是想用圖理論而不是關係型理論。

The common analogue for graph-based approaches is Google’s Pregel, of which Gremlin and Giraph are open source alternatives. In fact, here’s a great read on how mimicry of Google technologies is a cottage industry unto itself.
另一個類似的圖基礎的理論是Google的Pregel,相比來說Gremlin和Giraph是其的開源替代。實際上,這些都是Google技術的山寨實現的例子。

Why should you care? 為什麼要關新?
Graphs do a great job of modeling computer networks, and social networks, too — anything that links data together. Another common use is mapping, and geographic pathways — calculating shortest routes for example, from place A to place B (or to return to the social case, tracing the proximity of stated relationships from person A to person B).
圖在計算網路建模和社會化網路方面發揮著重要作用,能夠連線任意的資料。另外一個經常的應用是對映和地理資訊計算。從A到B的地點,計算最短的距離。
Graphs are also popular for bioscience and physics use cases for this reason — they can chart molecular structures unusually well, for example.
圖在生物計算和物理計算領域也有廣泛的應用,例如,他們能繪製不尋常的分子結構。

Big picture, graph databases and analysis languages and frameworks are a great illustration of how the world is starting to realize that Big Data is not about having one database or one programming framework that accomplishes everything. Graph-based approaches are a killer app, so to speak, for anything that involves large networks with many nodes, and many linked pathways between those nodes.
海量的圖,圖資料庫和分析語言和框架都是一種現實世界上實現大資料中的一部分。圖基礎的理論是一個殺手級的應用,為什麼這麼說?任何一個解決大型網路節點問題,都是通過節點和節點之間的路徑來處理的。

The most innovative scientists and engineers know to apply the right tool for each job, making sure everything plays nice and can talk to each other (the glue in this sense becomes the core competence).
很多富有創造力的科學家和工程師們,都很明白的用正確的工具來解決對應的問題。確保他們都能執行的漂亮並能被廣泛傳播。

SAP HANA
SAP Hana is an in-memory analytics platform that includes an in-memory database and a suite of tools and software for creating analytical processes and moving data in and out, in the right formats.
SAP Hana 是一個全記憶體的分析平臺,它包含了一個記憶體資料庫和一些相關的工具軟體用來建立分析流程和規範正確的格式來進行資料的輸入輸出。
Why should you care? 為什麼應該關心?
SAP is going against the grain of most entrenched enterprise mega-players by providing a very powerful product, free for development use. And it’s not only that — SAP is also creating meaningful incentives for startups to embrace Hana as well. They are authentically fostering community involvement and there is uniformly positive sentiment around Hana as a result.
SAP 開始反對為固化的企業使用者提高強大的產品,供開發免費使用。這個不僅僅是SAP開始為初創著想,讓其使用Hana。他們授權培養社群解決方案,這些不尋常的做法是圍繞Hana的結果。

Hana highly benefits any applications with unusually fast processing needs, such as financial modeling and decision support, website personalization, and fraud detection, among many other use cases.
Hana 假設其他的程式處理時候還不夠快的解決遇到的問題,例如,金融建模和決策支援,網站個性化和欺騙檢測等等。

The biggest drawback of Hana is that “in-memory” means that it by definition leverages access to solid state memory, which has clear advantages, but is much more expensive than conventional disk storage.
Hana最大的缺點是”全記憶體“這意味著訪問軟狀態的記憶體,這個是很明確的有點,但是這個也是相比磁碟儲存來說很昂貴的部分。

For organizations that don’t mind the added operational cost, Hana means incredible speed for very-low latency big data processing.
據組織者說,不用擔心操作成本,Hana是快速的地延遲的大資料處理工具。

HONORABLE MENTION: D3
D3 doesn’t make the list quite yet, but it’s close, and worth mentioning for that reason.
D3 本來不在列表中,但是它的親切感,讓我們認為有提它的價值。

D3 is a javascript document visualization library that revolutionizes how powerfully and creatively we can visualize information, and make data truly interactive. It was created by Michael Bostock and came out of his work at the New York Times, where he is the Graphics Editor.
D3是一個javascript面向文件的視覺化的類庫,。它強大的創新性的讓我們能直接看到資訊和讓我們進行正常的互動。
它的作者是Michael Bostock一個紐約時報的圖形介面設計師。
For example, you can use D3 to generate an HTML table from an array of numbers. Or, you can use the same data to create an interactive bar chart with smooth transitions and interaction.
例如,你可以使用D3來從任意數量的陣列中建立H™l表格。你能使用任意的資料來建立互動進度條等。
Here’s an example of D3 in action, making President Obama’s 2013 budget proposal understandable, and navigable.
這裡是一個D3的實際例子,建立2013年奧巴馬的民意情況。

With D3, programmers can create dashboards galore. Organizations of all sizes are quickly embracing D3 as a superior visualization platform to the heads-up displays of yesteryear.
使用D3,程式設計師能之間建立介面,組織所有的各種型別的資料。

Editor’s note: Tim Gasper is the Product Manager at Infochimps, the #1 Big Data platform in the cloud. He leads product marketing, product development, and customer discovery. Previously, he was co-founder and CMO at Keepstream, a social media curation and analytics company that Infochimps acquired in August of 2010. You should follow him on Twitter here.


雖然這篇文章不長,但是也費了我一段實際來翻譯,翻譯不足之處希望大家指正。其實看到這篇文章的時候,我就很想把它分享給喜歡它的人,得益於一個開放的環境,所以美國在IT領域總是這麼的讓人驚喜,當然我們也得跟上了。

開始正式的使用Hadoop已經有近一年的時間的了,這期間從百度出來,到初見在到現在的BitWare,在不同的公司,用不同的技術解決問題。但是本質上遇到的問題總是那麼幾個,當然現在很多公司也開始嚐鮮的使用Hadoop的了。這個是大環境是如此,可以理解。

以下說說個人對文章的理解:
Storm和Kafka 從11年起,就開始關注了,Storm在阿里也有部分二線應用,但是整體而言,剛剛滿一歲的Storm在nathanmarz大俠的打磨下越來越穩定了,並有部分線上的應用了。所以對這個技術,總體而言,我個人還是很看好的,因為現在使用hadoop無法實現實時的處理,使用HBase來為主要的資料庫來使用了,暫時還是能解決,但是還是想嘗試下Storm,Kafka的關注不是很多,不過這個配合起來使用,據說很贊,沒有自己跑過。

Drill這個是Apache的開源專案,之前也看了Google Dremel的論文,無奈看不是很懂,現在也沒有遇到這樣的環境,而且社群才剛剛火起來,所以還沒有很多的時間來跟進,暫時先擱置了。

R語言,之前在百度的時候,隔壁各位做的哥們就在使用R語言幹活,這個可能是隻有大公司能夠有能力去真正的挖掘的方面吧,我們現在的業務中基本沒有用到過,對於R還是很陌生,不過我個人任務,在不同的環境下使用不同的技術手段,猶如,博士聲光電吹盒子,我們架個電風吹,是一樣的實現吧。

對於圖資料庫領域,還真的是沒有遇到過詳細的應用,還沒有機會進入這樣的公司,所以還是束之高閣吧。

SPA這個公司,聽過名字,但是沒有具體的接觸過,現在賣解決方案估計也不好過,弄個東西出來提高下知名度還是必須的。現在啃老本的時代已經過去的了。

最後一個視覺化的JS類庫,興趣不大,業務現在不去做前端的了,所以也還好。