【原創】大數據基礎之Presto（1）簡介、安裝、使用

阿新 • • 發佈：2019-03-14

epo embedded mach img ans 公司 mkdir redis running

presto 0.217

技術分享圖片

官方：http://prestodb.github.io/

一簡介

Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.

Presto was designed and written from the ground up for interactive analytics and approaches the speed of commercial data warehouses while scaling to the size of organizations like Facebook.

presto是一個開源的分布式sql查詢引擎，用於大規模（從GB到PB）數據源的交互式分析查詢，並且達到商業數據倉庫的查詢速度；

Presto allows querying data where it lives, including Hive, Cassandra, relational databases or even proprietary data stores. A single Presto query can combine data from multiple sources, allowing for analytics across your entire organization.

Presto is targeted at analysts who expect response times ranging from sub-second to minutes. Presto breaks the false choice between having fast analytics using an expensive commercial solution or using a slow "free" solution that requires excessive hardware.

presto允許直接查詢外部數據，包括hive、cassandra、rdbms以及文件系統比如hdfs；一個presto查詢中可以同時使用多個數據源的數據來得到結果；presto在‘昂貴且快的商業解決方案’和‘免費且慢的開源解決方案’之間提供了一個新的選擇；

Facebook uses Presto for interactive queries against several internal data stores, including their 300PB data warehouse. Over 1,000 Facebook employees use Presto daily to run more than 30,000 queries that in total scan over a petabyte each per day.

facebook使用presto來進行多個內部數據源的交互式查詢，包括300PB的數據倉庫；每天有超過1000個facebook員工在PB級數據上使用presto運行超過30000個查詢；

Leading internet companies including Airbnb and Dropbox are using Presto.

業界領先的互聯網公司包括airbnb和dropbox都在使用presto，下面是airbnb的評價：

Presto is amazing. Lead engineer Andy Kramolisch got it into production in just a few days. It‘s an order of magnitude faster than Hive in most our use cases. It reads directly from HDFS, so unlike Redshift, there isn‘t a lot of ETL before you can use it. It just works.

--Christopher Gutierrez, Manager of Online Analytics, Airbnb

架構

Presto is a distributed system that runs on a cluster of machines. A full installation includes a coordinator and multiple workers. Queries are submitted from a client such as the Presto CLI to the coordinator. The coordinator parses, analyzes and plans the query execution, then distributes the processing to the workers.

presto是一個運行在集群上的分布式系統，包括一個coordinator和多個worker，client（比如presto cli）提交查詢到coordinator，然後coordinator解析、分析和計劃查詢如何執行，然後將任務分配給worker；

技術分享圖片

Presto supports pluggable connectors that provide data for queries. The requirements vary by connector.

presto提供插件化的connector來支持外部數據查詢，原生支持hive、cassandra、elasticsearch、kafka、kudu、mongodb、mysql、redis等眾多外部數據源；

詳細參考：https://prestodb.github.io/docs/current/connector.html

二安裝

1 下載

# wget https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.217/presto-server-0.217.tar.gz
# tar xvf presto-server-0.217.tar.gz
# cd presto-server-0.217

2 準備數據目錄

Presto needs a data directory for storing logs, etc. We recommend creating a data directory outside of the installation directory, which allows it to be easily preserved when upgrading Presto.

3 準備配置目錄

Create an etc directory inside the installation directory. This will hold the following configuration:

Node Properties: environmental configuration specific to each node
JVM Config: command line options for the Java Virtual Machine
Config Properties: configuration for the Presto server
Catalog Properties: configuration for Connectors (data sources)

# mkdir etc

# cat etc/node.properties
node.environment=production
node.id=ffffffff-ffff-ffff-ffff-ffffffffffff
node.data-dir=/var/presto/data

# cat etc/jvm.config
-server
-Xmx16G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:+ExitOnOutOfMemoryError

# cat etc/config.properties

# coordinator
coordinator=true
node-scheduler.include-coordinator=false
http-server.http.port=8080
query.max-memory=50GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB
discovery-server.enabled=true
discovery.uri=http://example.net:8080

# worker
coordinator=false
http-server.http.port=8080
query.max-memory=50GB
query.max-memory-per-node=1GB
query.max-total-memory-per-node=2GB
discovery.uri=http://example.net:8080

# cat etc/log.properties
com.facebook.presto=INFO

註意：
1）coordinator和worker的config.properties不同，主要是coordinator上會開啟discovery服務

discovery-server.enabled: Presto uses the Discovery service to find all the nodes in the cluster. Every Presto instance will register itself with the Discovery service on startup. In order to simplify deployment and avoid running an additional service, the Presto coordinator can run an embedded version of the Discovery service. It shares the HTTP server with Presto and thus uses the same port.

2）如果coordinator和worker位於不同機器，則設置

node-scheduler.include-coordinator=false

如果coordinator和worker位於相同機器，則設置

node-scheduler.include-coordinator=true

node-scheduler.include-coordinator: Allow scheduling work on the coordinator. For larger clusters, processing work on the coordinator can impact query performance because the machine’s resources are not available for the critical task of scheduling, managing and monitoring query execution.

4 配置connector

Presto accesses data via connectors, which are mounted in catalogs. The connector provides all of the schemas and tables inside of the catalog. For example, the Hive connector maps each Hive database to a schema, so if the Hive connector is mounted as the hive catalog, and Hive contains a table clicks in database web, that table would be accessed in Presto as hive.web.clicks.

以hive為例

# mkdir etc/catalog

# cat etc/catalog/hive.properties
connector.name=hive-hadoop2
hive.metastore.uri=thrift://example.net:9083
#hive.config.resources=/etc/hadoop/conf/core-site.xml,/etc/hadoop/conf/hdfs-site.xml

詳細參考：https://prestodb.github.io/docs/current/connector.html

5 啟動

調試啟動

# bin/launcher run --verbose

Presto requires Java 8u151+ (found 1.8.0_141)

需要jdk1.8.151以上

正常之後，後臺啟動

# bin/launcher start

三使用

1 下載cli

# wget https://repo1.maven.org/maven2/com/facebook/presto/presto-cli/0.217/presto-cli-0.217-executable.jar
# mv presto-cli-0.217-executable.jar presto
# chmod +x presto
# ./presto --server localhost:8080 --catalog hive --schema default

2 jdbc

# wget https://repo1.maven.org/maven2/com/facebook/presto/presto-jdbc/0.217/presto-jdbc-0.217.jar
# export HIVE_AUX_JARS_PATH=/path/to/presto-jdbc-0.217.jar
# beeline -u jdbc:presto://example.net:8080/hive/sales

參考：https://prestodb.github.io/overview.html

【原創】大數據基礎之Presto（1）簡介、安裝、使用

epo embedded mach img ans 公司 mkdir redis running presto 0.217 官方：http://prestodb.github.io/ 一簡介 Presto is an open source distrib

【原創】大數據基礎之Presto（1）簡介、安裝、使用

一簡介

架構

二安裝

1 下載

2 準備數據目錄

3 準備配置目錄

4 配置connector

5 啟動

調試啟動

正常之後，後臺啟動

三使用

1 下載cli

2 jdbc

【原創】大數據基礎之Presto（1）簡介、安裝、使用

【原創】大數據基礎之Kudu（1）簡介、安裝

【原創】大數據基礎之Mesos（1）簡介、安裝、使用

【原創】大數據基礎之Spark（4）RDD原理及代碼解析

【原創】大數據基礎之Spark（7）spark讀取文件split過程（即RDD分區數量）

【原創】大數據基礎之Spark（9）spark部署方式yarn/mesos

【原創】大數據基礎之Benchmark（4）TPC-DS測試結果（hive spark impala）

【原創】大數據基礎之ElasticSearch（5）重要配置及調優

【原創】大數據基礎之Logstash（4）高可用

【原創】運維基礎之Ansible（1）簡介、安裝和使用

【原創】運維基礎之Nginx（1）簡介、安裝、使用

【原創】算法基礎之Anaconda（1）簡介、安裝、使用

【原創】運維基礎之Redis（1）簡介、安裝、使用

【原創】大資料基礎之Hive（1）Hive SQL執行過程

大數據基礎之ORC（1）簡介

【原創】大數據基礎之集群搭建

【原創】大資料基礎之Spark（4）RDD原理及程式碼解析

【原創】大資料基礎之Spark（5）Shuffle實現原理及程式碼解析

【原創】大資料基礎之Spark（6）rdd sort實現原理

【原創】大資料基礎之Spark（7）spark讀取檔案split過程（即RDD分割槽數量）

【原創】大數據基礎之Presto（1）簡介、安裝、使用

一 簡介

架構

二 安裝

1 下載

2 準備數據目錄

3 準備配置目錄

4 配置connector

5 啟動

調試啟動

正常之後，後臺啟動

三 使用

1 下載cli

2 jdbc

相關推薦

一簡介

二安裝

三使用