1. 程式人生 > >presto-0.147+postgresql-9.5.3+msql-5.0.7+hadoop-2.5.2+hive-1.2.1環境構築以及測試

presto-0.147+postgresql-9.5.3+msql-5.0.7+hadoop-2.5.2+hive-1.2.1環境構築以及測試

背景

每個支援SQL的資料庫,都有一個強大的SQL引擎。

而對於SQL引擎,基本都是大同小異的,負責SQL文法解析,語意分析,指定查詢樹,優化查詢樹,再到最終的執行,客戶端返回結果。

presto的也跟一般的是一樣的。

架構如下:


準備

1.postgresql-9.5.3

2.mysql-5.0.7

3.hadoop-2.5.2

4.hive-1.2.1

5.presto-server-0.147

6.presto-cli-0.147-executable.jar

且注意系統要求:

Mac OS X or Linux
Java 8 Update 60 or higher (8u60+), 64-bit
Maven 3.3.9+ (for building)
Python 2.4+ (for running with the launcher script)


環境搭建_1

mysql,postgresql都是在windows這邊搭建的,直接就可以使用。

hadoop-2.5.2的搭建手順之前的博文中已經記載了,此處不再說明。

hive的環境,解壓後就可以使用了。

這裡主要說一下hive的兩種CLI工具:

1.hive shell

2.beeline

現在官網標記beeline是new,hive shell是older。建議使用beeline,結果的現實比較直觀易懂。跟mysql的比較像。

beeline的啟動,如果hive使用預設的debery資料庫的話,請使用下面的方式啟動

./bin/beeline -u jdbc:hive2://
另外,derby只能同時一個使用者使用,否則會報錯如下所示:
Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database /home/myProject/apache-hive-1.2.1-bin/metastore_db.
        at org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
        at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.privGetJBMSLockOnDB(Unknown Source)
        at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.run(Unknown Source)
        at java.security.AccessController.doPrivileged(Native Method)
        at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.getJBMSLockOnDB(Unknown Source)
        at org.apache.derby.impl.store.raw.data.BaseDataFileFactory.boot(Unknown Source)
        at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Source)
        at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown Source)
        at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(Unknown Source)
        at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Unknown Source)
        at org.apache.derby.impl.store.raw.RawStore.boot(Unknown Source)
        at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Source)
        at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown Source)
        at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(Unknown Source)
        at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Unknown Source)
        at org.apache.derby.impl.store.access.RAMAccessManager.boot(Unknown Source)
        at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Source)
        at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown Source)
        at org.apache.derby.impl.services.monitor.BaseMonitor.startModule(Unknown Source)
        at org.apache.derby.iapi.services.monitor.Monitor.bootServiceModule(Unknown Source)
        at org.apache.derby.impl.db.BasicDatabase.bootStore(Unknown Source)
        at org.apache.derby.impl.db.BasicDatabase.boot(Unknown Source)
        at org.apache.derby.impl.services.monitor.BaseMonitor.boot(Unknown Source)
        at org.apache.derby.impl.services.monitor.TopService.bootModule(Unknown Source)
        at org.apache.derby.impl.services.monitor.BaseMonitor.bootService(Unknown Source)
        at org.apache.derby.impl.services.monitor.BaseMonitor.startProviderService(Unknown Source)
        at org.apache.derby.impl.services.monitor.BaseMonitor.findProviderAndStartService(Unknown Source)
        at org.apache.derby.impl.services.monitor.BaseMonitor.startPersistentService(Unknown Source)
        at org.apache.derby.iapi.services.monitor.Monitor.startPersistentService(Unknown Source)
        ... 83 more
Error applying authorization policy on hive configuration: java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient

處理方法也很簡單:
rm -rf derby.log metastore_db

環境搭建_2

presto

  1. 解壓presto-server-0.147.tar.gz
  2. mkdir presto-server-0.147/etc
  3. mkdir presto-server-0.147/catalog
  4. vim etc/node.properties
    node.environment=production
    node.id=1
    node.data-dir=/home/myProject/presto-server-0.147/data
    

  5. etc/jvm.config
    -server
    -Xmx16G
    -XX:+UseG1GC
    -XX:G1HeapRegionSize=32M
    -XX:+UseGCOverheadLimit
    -XX:+ExplicitGCInvokesConcurrent
    -XX:+HeapDumpOnOutOfMemoryError
    -XX:OnOutOfMemoryError=kill -9 %p
    

  6. etc/config.properties
    coordinator=true
    node-scheduler.include-coordinator=true
    http-server.http.port=8080
    query.max-memory=5GB
    query.max-memory-per-node=1GB
    discovery-server.enabled=true
    discovery.uri=http://your-hive-IP:8080
    

  7. etc/log.properties
    com.facebook.presto=INFO
    
  8. 下載presto-cli-0.147-executable.jar 
  9. 將其放置在bin目錄下
  10. 賦許可權
    chmod +x presto-cli-0.147-executable.jar

  11. etc/catalog/mysql.properties
    connector.name=mysql
    connection-url=jdbc:mysql://your-mysql-location-IP:3306
    connection-user=your-mysql-username
    connection-password=your-mysql-password
    
  12. etc/catalog/postgresql.properties
    connector.name=postgresql
    connection-url=jdbc:postgresql://your-postgresql-location-ip/postgres
    connection-user=your-postgres-username
    connection-password=your-postgresql-password
    
  13. etc/catalog/hive.properties
    connector.name=hive-hadoop2
    hive.metastore.uri=thrift://your-hive-ip:9083
    hive.config.resources=/etc/hadoop/core-site.xml,/etc/hadoop/hdfs-site.xml
    
    connector.name的選取參照如下資訊
    hive-hadoop1: Apache Hadoop 1.x
    hive-hadoop2: Apache Hadoop 2.x
    hive-cdh4: Cloudera CDH 4
    hive-cdh5: Cloudera CDH 5

啟動

  1. bin/launcher start
  2. ./presto --server localhost:8080--cataloghive--schema default

結果

MySQL

presto:test_hive> select * from mysql.sqoop.t1;
 id | int_col | char_col 
----+---------+----------
  1 |       1 | a        
  2 |       2 | b        
  4 |       4 | d        
  3 |       3 | c        
  5 |       5 | e        
(5 rows)

Query 20160520_101400_00009_k46dt, FINISHED, 1 node
http://localhost:8080/query.html?20160520_101400_00009_k46dt
Splits: 2 total, 0 done (0.00%)
CPU Time: 0.0s total,     0 rows/s,     0B/s, 100% active
Per Node: 0.0 parallelism,     0 rows/s,     0B/s
Parallelism: 0.0
0:29 [0 rows, 0B] [0 rows/s, 0B/s]


postgresql

presto:test_hive> select * from postgresql.public.test;
 id | name 
----+------
  1 | lily 
  2 | Tom  
  3 | Jim  
(3 rows)

Query 20160520_101503_00010_k46dt, FINISHED, 1 node
http://localhost:8080/query.html?20160520_101503_00010_k46dt
Splits: 2 total, 0 done (0.00%)
CPU Time: 0.0s total,     0 rows/s,     0B/s, 0% active
Per Node: 0.0 parallelism,     0 rows/s,     0B/s
Parallelism: 0.0
0:02 [0 rows, 0B] [0 rows/s, 0B/s]


mysql&postgresql

presto:test_hive> select id,char_col from mysql.sqoop.t1 union select id,name from postgresql.public.test;
 id | char_col 
----+----------
  1 | lily     
  2 | Tom      
  3 | Jim      
  1 | a        
  2 | b        
  4 | d        
  3 | c        
  5 | e        
(8 rows)

Query 20160520_101532_00011_k46dt, FINISHED, 1 node
http://localhost:8080/query.html?20160520_101532_00011_k46dt
Splits: 6 total, 2 done (33.33%)
CPU Time: 0.0s total,   107 rows/s,     0B/s, 17% active
Per Node: 0.0 parallelism,     0 rows/s,     0B/s
Parallelism: 0.0
0:28 [3 rows, 0B] [0 rows/s, 0B/s]


hive

presto:test_hive> select count(*) from stream;
  _col0   
----------
 10353632 
(1 row)

Query 20160524_054416_00010_ceya5, FINISHED, 1 node
http://localhost:8080/query.html?20160524_054416_00010_ceya5
Splits: 42 total, 40 done (95.24%)
CPU Time: 6.0s total, 1.67M rows/s,  210MB/s, 18% active
Per Node: 0.7 parallelism, 1.16M rows/s,  146MB/s
Parallelism: 0.7
0:09 [10.1M rows, 1.24GB] [1.16M rows/s, 146MB/s]


mysql+postgresql+hive跨DB結合查詢

presto:test_hive> select char_col from mysql.mysqldb.test union select name from postgresql.public.test union select userid from stream limit 10;
  char_col   
-------------
 lily        
 Tom         
 Jim         
 user_000087 
 user_000031 
 user_000062 
 user_000063 
 user_000088 
 user_000089 
 user_000064 
(10 rows)

Query 20160524_054314_00009_ceya5, FINISHED, 1 node
http://localhost:8080/query.html?20160524_054314_00009_ceya5
Splits: 44 total, 1 done (2.27%)
CPU Time: 0.5s total,  651K rows/s, 80.4MB/s, 16% active
Per Node: 0.0 parallelism, 16.6K rows/s, 2.05MB/s
Parallelism: 0.0
0:21 [352K rows, 43.5MB] [16.6K rows/s, 2.05MB/s]

直接跨DB查詢,這是presto的一個特色,在生產環境下,海量資料的來源並不是單一的,為了能實時的進行資料分析,這個就顯得比較尤為方便了。

但從官方文件中來看,目前presto支援的資料來源只有以下13種:

1. Black Hole Connector
2. Cassandra Connector
3. Hive Connector
4. JMX Connector
5. Kafka Connector
6. Kafka Connector Tutorial
7. MongoDB Connector
8. MySQL Connector
9. PostgreSQL Connector
10. Redis Connector
11. System Connector
12. TPCH Connector
13. Local File Connector

另外,presto-jdbc-0.147.jar是標準的JDBC,下面型別的JDBC URL都是支援的:
jdbc:presto://host:port
jdbc:presto://host:port/catalog
jdbc:presto://host:port/catalog/schema

----over----