1. 程式人生 > >大數據基礎之ORC(1)簡介

大數據基礎之ORC(1)簡介

ups fields with including seve cor val posit record

https://orc.apache.org

技術分享圖片

Optimized Row Columnar (ORC) file

層次結構:

file -> stripes -> row groups(10000 rows)

行列混合存儲

Background

Back in January 2013, we created ORC files as part of the initiative to massively speed up Apache Hive and improve the storage efficiency of data stored in Apache Hadoop. The focus was on enabling high speed processing and reducing file sizes.

ORC是為了加速hive查詢以及節省hadoop磁盤空間而生的;

ORC is a self-describing type-aware columnar file format designed for Hadoop workloads. It is optimized for large streaming reads, but with integrated support for finding required rows quickly. Storing data in a columnar format lets the reader read, decompress, and process only the values that are required for the current query. Because ORC files are type-aware, the writer chooses the most appropriate encoding for the type and builds an internal index as the file is written.

ORC是一種自描述的列式存儲文件類型,它為了大規模流式讀取而特別優化,同時也支持快速定位需要的行;列式存儲使得reader可以只讀取、解壓和處理他們需要的值;

Predicate pushdown uses those indexes to determine which stripes in a file need to be read for a particular query and the row indexes can narrow the search to a particular set of 10,000 rows. ORC supports the complete set of types in Hive, including the complex types: structs, lists, maps, and unions.
( In ORC, the minimum and maximum values of each column are recorded per file, per stripe (~1M rows), and every 10,000 rows.

Using this information, the reader should skip any segment that could not possibly match the query predicate.
Predicate pushdown is amazing when it works, but for a lot of data sets, it doesn‘t work at all. If the data has a large number of distinct values and is well-shuffled, the minimum and maximum stats will cover almost the entire range of values, rendering predicate pushdown ineffective. )

Predicate Pushdown謂詞下推,是一個來自RDBMS的概念,即在不影響結果的前提下,盡量將過濾條件提前執行,這樣可以顯著減少過程中的數據量;

ORC的Predicate Pushdown使用index來判斷在一個file中哪些stripes需要被讀取,並且可以將查詢範圍縮小到10000行的集合;實現原理在ORC中,每個file/每個stripe/每10000行,都有索引會記錄該數據範圍內每列最大最小值等統計信息,所以可以很容易的根據查詢條件判斷是否需要讀取相應的file/stripe/10000行;

ORC支持hive中所有的數據類型;

ORC files are divided in to stripes that are roughly 64MB by default. The stripes in a file are independent of each other and form the natural unit of distributed work. Within each stripe, the columns are separated from each other so the reader can read just the columns that are required.

ORC file會被分塊成為多個stripes,每個stripes大概64M,大概100w行;一個文件中的不同stripes是相互獨立的;在一個stripe中,不同的列也是分開存儲的;

技術分享圖片

ORC uses type specific readers and writers that provide light weight compression techniques such as dictionary encoding, bit packing, delta encoding, and run length encoding – resulting in dramatically smaller files. Additionally, ORC can apply generic compression using zlib, or Snappy on top of the lightweight compression for even smaller files. However, storage savings are only part of the gain. ORC supports projection, which selects subsets of the columns for reading, so that queries reading only one column read only the required bytes. Furthermore, ORC files include light weight indexes that include the minimum and maximum values for each column in each set of 10,000 rows and the entire file. Using pushdown filters from Hive, the file reader can skip entire sets of rows that aren’t important for this query.

ORC使用類型相關的reader和writer來提供輕量級的壓縮技術,結果是文件尺寸被極大的縮小了;除此之外,ORC還可以在內置輕量級壓縮的基礎上使用常用的壓縮格式比如zlib、snappy等;

ORC stores the top level index at the end of the file. The overall structure of the file is given in the figure above. The file’s tail consists of 3 parts; the file metadata, file footer and postscript.
The metadata for ORC is stored using Protocol Buffers, which provides the ability to add new fields without breaking readers.

ORC在文件的末尾存儲頂級index;文件末尾包含3個部分:metadata、footer、postscript;metadata是使用Protocol Buffer格式(可以增加新的列並且不影響讀舊數據)存儲的;

Stripes

The body of ORC files consists of a series of stripes. Stripes are large (typically ~200MB) and independent of each other and are often processed by different tasks. The defining characteristic for columnar storage formats is that the data for each column is stored separately and that reading data out of the file should be proportional to the number of columns read.

每一個ORC file都包含多個stripes,stripes之間相互獨立,可以被不同的任務並行處理;

In ORC files, each column is stored in several streams that are stored next to each other in the file.

The stripe footer contains the encoding of each column and the directory of the streams including their location.

Stripes have three sections: a set of indexes for the rows within the stripe, the data itself, and a stripe footer. Both the indexes and the data sections are divided by columns so that only the data for the required columns needs to be read.

stripes有3個部分:index集合、data、footer;其中index和data中,列和列之間都是分開存放的;

The row group indexes consist of a ROW_INDEX stream for each primitive column that has an entry for each row group. Row groups are controlled by the writer and default to 10,000 rows. Each RowIndexEntry gives the position of each stream for the column and the statistics for that row group.

row group默認是10000行;

Bloom Filters are added to ORC indexes from Hive 1.2.0 onwards. Predicate pushdown can make use of bloom filters to better prune the row groups that do not satisfy the filter condition. A BLOOM_FILTER stream records a bloom filter entry for each row group (default to 10,000 rows) in a column. Only the row groups that satisfy min/max row index evaluation will be evaluated against the bloom filter index.

從hive1.2開始,Bloom Filter被引入,可以更好的支持Predicate Pushdown;

大數據基礎之ORC(1)簡介