1. 程式人生 > >shell 指令碼執行python指令碼,連線hive提交資料寫入表

shell 指令碼執行python指令碼,連線hive提交資料寫入表

使用說明

1.cd /opt/zy
在這個目錄下以root使用者許可權執行命令
2.
在SAP查詢的時候
Tcode:ZMMR0005
Purchase Org *
PO Creating:2017/3/1 (開始日期) 2017/6/31(結束日期)
Vendor
1000341
plant *

這樣查詢處理的結果代表 發貨日期在20170301-20170631的所有記錄,不管到達日期在那個月

從SAP匯出資料表格,存為txt形式以”\t”分隔
用rz命令把匯出的檔案上傳到/opt/zy目錄下,
3.執行命令 注意引數必須嚴格符合XXXXXXXXtoYYYYYYYY的格式,代表startdate to enddate
example:
[

[email protected] zy]# bash try2.sh 20170301to20170632
4.去Hue裡查詢分析結果
SELECT * from saplifttime WHERE querypocredatestart=’XXXXXXXX’[and querypocredateend=’YYYYYYYY’];
run command
5.如果想看原資料,去pcg.sap表,命令如下:
SELECT * from sap WHERE querypocredatestart=’20170301’;

執行結果截圖:
rsesult

技術實現說明

用shell 指令碼呼叫python指令碼

shell 指令碼 try2.sh

#!/bin/sh
#echo $1
daterange=$1#賦值給daterange這個變數是因為後面擷取字串要用到,否則我不會寫
python3 /opt/zy/runtask.py $1 #執行python指令碼
startdate=${daterange:0:8}   #擷取查詢的開始日期
#echo $startdate
enddate=${daterange:10:18}   #擷取查詢的結束日期
#echo $enddate
sed -i '1,3d' /opt/zy/$1.txt   #刪除前三行,因為前三行是空行
sed 's/.\{1\}//' $1.txt>$1regular.txt #刪除第一列,因為第一列是空列
hdfs dfs -put -f /opt/zy/$1regular.txt /user/hive/pcg-data/zhouyi6_files #把伺服器上的本地檔案上傳到hadoop叢集上 hive -e "LOAD DATA INPATH '/user/hive/pcg-data/zhouyi6_files/$1regular.txt' INTO TABLE pcg.sap partition(querypocredatestart=$startdate,querypocredateend=$enddate)" #把檔案的資料載入表 rm $1.txt #刪除本地原檔案,只保留格式處理後的檔案

備註:
1.因為sed命令 不修改檔案本身,所以要把修改後的結果存入新檔案 +regular字尾的
2.sed -i,-i代表不把刪除前三行後的結果顯示在命令列上
3.hdfs dfs -put -f
-f option will overwrite the destination if it already exists.
4.執行這個指令碼的前提是已經建立了pcg.sap表,建表語句如下:

CREATE TABLE SAP(`PO Cre Date` string,
`Vendor` string, 
`WW Partner` string, 
`Name of Vendor` string,
`PO Cre by` string, 
`Purch Doc Type` string,
`Purch Order` string,
`PO Item` string,
`Deletion Indicator in PO Item` string, 
`Request Shipment Day` string,
`Material` string,
`Short Text` string, 
`Plant` string, 
`Issuing Stor location` string,
`Receive Stor loaction` string, 
`PO item change date` string, 
`Delivery Priority` string,
`PO Qty` string,
`Total GR Qty` string,
`Still to be delivered` string,
`Delivery Note` string,
`Delivery Note Type (ASN or DN)` string, 
`Delivery Note item` string,
`Delivery Note qty` string, 
`Delivery Note Creation Date` string,
`Delivery Note ACK Date` string, 
`Incoterm` string, 
`Part Battery Indicator` string,
`BOL/AWBill` string, 
`Purchase order type` string, 
`Gr Date`string) 
partitioned by (`queryPoCreDateStart` string,`queryPoCreDateEnd` string)
row format delimited fields terminated by "\t" stored as textfile

python指令碼

import  pandas as pd
import  sys
data = pd.read_csv(sys.argv[1]+".txt", sep="\t")
#print(data.columns)
data['Delivery Note Creation Date']=pd.to_datetime(data['Delivery Note Creation Date'],format='%d.%m.%Y')
data['Gr Date']=pd.to_datetime(data['Gr Date'],format='%d.%m.%Y')
data=data.drop(data[data['Delivery Note Creation Date'].isnull()].index.tolist())#刪除某列為空值所在的行
data=data.drop(data[data['Gr Date'].isnull()].index.tolist())#刪除某列為空值所在的行
data['delta']=(data['Gr Date']-data['Delivery Note Creation Date']).apply(lambda  x:x.days)#相差的時間
print(data['delta'].describe())
#sql_content="insert into table saplifttime values(%,%s,%s,%s,%s,%s,%s,%s,%s,%s)"%\
import hdfs
from impala.dbapi import connect
filename=sys.argv[1]+".txt"
hdfspath='/user/hive/pcg-data/zhouyi6_files'
client=hdfs.Client("http://10.100.208.222:50070")#50070
#8888是我登入WEB 操作介面時候的介面
#print(client.status("/user/zhouyi",strict=True))#檢視路徑資訊
#print(client.list("/user/zhouyi"))#檢視資料夾下的檔案
#client.upload(hdfs_path=hdfspath,local_path="/opt/zy/"+filename,overwrite=True)
# overwrtie=True means Delete any uploaded files if an error occurs during theupload.
conn = connect(host='10.100.208.222', port=21050,database='pcg')
cur = conn.cursor()
stdate,edate=sys.argv[1].split("to")
#print(sys.argv[1])
cnt=str(data['delta'].describe()[0])
mean=str(data['delta'].describe()[1])
std=str(data['delta'].describe()[2])
mini=str(data['delta'].describe()[3])
twentyfive=str(data['delta'].describe()[4])
fifty=str(data['delta'].describe()[5])
seventyfive=str(data['delta'].describe()[6])
maxm=str(data['delta'].describe()[7])
args=[stdate,edate,cnt,mean,std,mini,twentyfive,fifty,seventyfive,maxm]
print(args)

#對的SQL
#sql_content="insert into table saplifttime values("+str(5555)+",'20200607','22','4.2','9.88','1','2','5','10','9999999999999')"
sql_content="insert into table saplifttime values(?,?,?,?,?,?,?,?,?,?)"
cur.execute(sql_content,args)#把運算結果插到表pcg.saplifttime裡

備註:
1.執行cur.execute的前提是已經建好pcg.saplifttime的表,建表語句如下:

CREATE TABLE SAPLifttime(querypocredatestart STRING,querypocredateend STRING,cnt STRING,mean STRING,std STRING,minimum STRING,25percent STRING,50percent STRING,75percent STRING,maxmum STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t" STORED AS Textfile

2.計算邏輯:
第一步:把欄位“Delivery Note Creation Date”視為貨物發出日期,如果為空則刪除該行
第二步:把欄位“Gr Date”視為貨物到達日期,如果為空則刪除該行
第三步:貨物在途時間= Gr Date - Delivery Note Creation Date
第四步:對貨物在途時間求cnt,mean,std,minimum,25%,50%,75%,maxmum

踩過的坑:
1.我的表字段都是STRING型別,values的佔位符問題,我一開始試過%s,%d,總與python裡對應的值格式不匹配。後來用?佔位就好了
2.cur.excute(sql.args)這樣寫的好處在於看起來清晰,不用拼接特別長的sql字串了,非常容易拼錯