1. 程式人生 > >轉:greenplum使用gpfdist與外部表高效匯入資料

轉:greenplum使用gpfdist與外部表高效匯入資料

greenplum作為OLAP分析型軟體,自然避免不了從外部資料庫載入大量的資料,然而傳統的ETL資料傳輸方法(select=>insert)到GP需要經過GP的單點master,效率非常低。 下面介紹外部表用gpfdist快速匯入資料: 普通外部表和可寫外部表區別: 1、普通外部表只能select,可寫外部表只能insert 2、可寫外部表沒有錯誤表 3、可寫外部表可以指定分佈鍵,如果不指定,預設隨機分佈;普通外部表只能隨機分佈 gpfdist優勢: 1、直接由segment併發載入 2、直接載入資料檔案,並可讀寫(和選擇的外部表型別有關) 3、預設資料隨機分配,每個節點負載均衡(和選擇的外部表型別有關) 示例: 1、啟動gpfdist 安裝完GP後,自帶gpfdist檔案,直接指定目錄、埠等就能啟動服務。如果需要獨立的檔案伺服器,則需要在檔案伺服器上單獨下載gpfdist使用 [[email protected] ~]nohup /disk/GP/bin/gpfdist -p 8081 -d /disk/upload & 使用nohup &是起守護程序作用,不然執行啟動服務的客戶端關閉後,這個程序也會被關閉;指定埠8081,指定檔案伺服器目錄/disk/upload [[email protected] ~]$ ps -ef|grep gpfdist gpadmin    816 32606  0 17:08 pts/4    00:00:00 grep gpfdist gpadmin  13036     1  0 Oct21 ?        00:00:44 /disk/GP/bin/gpfdist -p 8081 -d /disk/upload
2、建立普通外部表 CREATE  EXTERNAL TABLE "ods"."order" ( "id" varchar(64), "create_by" varchar(64), "create_date" timestamp, "update_by" varchar(64), "update_date" timestamp, "del_flag" char(1), "user_id" varchar(64), "user_name" varchar(64), "account_id" varchar(64), "equ_def_id" varchar(64), "amount_in" numeric(20,2), "amount_in_money" numeric(20,2), "amount_out" numeric(20,2), "amount_out_money" numeric(20,2), "fee" numeric(20,2), "fee_discount" numeric(20,2), "money" numeric(20,2), "actual_pay_money" numeric(20,2), "discount_money" numeric(20,2), "pay_time" timestamp, "order_type" varchar(16), "status" varchar(64), "rel_biz_type" varchar(64), "rel_biz_id" varchar(64), "equ_agreement" varchar(64), "remarks" varchar(100), "transaction_type" varchar(1), "pay_stop_date" timestamp, "stop_time" timestamp, "bid_method" varchar(64), "profit_fee_rate" numeric(20,2), "quit_charge_rate" numeric(20,2), "return_tb" numeric(20,2), "last_order_id" varchar(64), "root_order_id" varchar(64), "invest_name_for_me" varchar(64), "invest_name_for_buyer" varchar(64), "invest_start_time" timestamp, "available_amount" numeric(20,2), "surplus_days" int4, "transferable_flag" varchar(64), "trc_order_id" varchar(64) ) LOCATION ('gpfdist://gp-master:8081/order.csv') format 'csv' (DELIMITER ';'); 注意有幾個坑:
1、這個實驗資料order.csv是生產環境mysql的測試資料,表中有時間型別欄位,如果從線上資料匯出到txt文字,這個時候你的時間格式就變成了varchar,再匯入到GP的時間型別欄位時候會報格式錯誤,所以儘量匯出為csv格式的文字 2、指定gpfdist資訊:LOCATION ('gpfdist://檔案伺服器主機名或IP:gpfdist埠/載入檔案') format '檔案格式' (DELIMITER '分隔字元') 3、從線上資料匯出到文字,不要行頭,不要封閉符,只需要分隔字元
測試資料90W記錄的資料,gpfdist載入資料在秒級別,效率很高
建立可寫外部表 CREATE WRITABLE  EXTERNAL TABLE "ods"."order1" ( "id" varchar(64), "create_by" varchar(64), "create_date" timestamp, "update_by" varchar(64), "update_date" timestamp, "del_flag" char(1), "user_id" varchar(64), "user_name" varchar(64), "account_id" varchar(64), "equ_def_id" varchar(64), "amount_in" numeric(20,2), "amount_in_money" numeric(20,2), "amount_out" numeric(20,2), "amount_out_money" numeric(20,2), "fee" numeric(20,2), "fee_discount" numeric(20,2), "money" numeric(20,2), "actual_pay_money" numeric(20,2), "discount_money" numeric(20,2), "pay_time" timestamp, "order_type" varchar(16), "status" varchar(64), "rel_biz_type" varchar(64), "rel_biz_id" varchar(64), "equ_agreement" varchar(64), "remarks" varchar(100), "transaction_type" varchar(1), "pay_stop_date" timestamp, "stop_time" timestamp, "bid_method" varchar(64), "profit_fee_rate" numeric(20,2), "quit_charge_rate" numeric(20,2), "return_tb" numeric(20,2), "last_order_id" varchar(64), "root_order_id" varchar(64), "invest_name_for_me" varchar(64), "invest_name_for_buyer" varchar(64), "invest_start_time" timestamp, "available_amount" numeric(20,2), "surplus_days" int4, "transferable_flag" varchar(64), "trc_order_id" varchar(64) ) LOCATION ('gpfdist://gp-master:8081/order1.csv') format 'CSV' (DELIMITER ';') DISTRIBUTED BY (
id );