給定了經緯度的一張my_latlng表,和一個柵格my_grid表,怎麽實現my_latlng表回填柵格id?
- 場景:
假設我們擁有一個擁有了一系列經緯度的表my_latlng(lat string,lng string)表,還有一張給定的柵格表my_grid(gridid bigint,centerlng double,centerlat double,gridx int,gridy int,minlng double,maxlng double,minlat double,maxlat double)並且柵格的為一個邊長為5m的正方形,其中:
gridid :柵格id
centerlng:柵格中心點經度
centerlat :柵格中心點緯度
gridx :柵格x軸方向的坐標位置gridy :柵格y軸方向的坐標位置
需求:給my_latlng表找它所落的柵格的id,如果my_latlng中的經緯度,在總體柵格以外,就不參與運算。
- 解決方案一:
由於柵格有最大、最小經緯度,因此可以直接使用柵格的經緯度範圍來給my_latlng表回填柵格id:
select t11.gridid,t10.lat,t10.lng from my_latlng t10 inner join my_grid t11 where t10.lat>=t11.minlat and t10.lat<=t11.maxlat and t10.lng>=t11.minlng andt10.lng<t11.minlng;
缺陷:該種方案缺陷inner join 是沒有on條件的,如果在hive中是沒有辦法把>=,>,<,<=符號給寫到inner join 中 on條件上的,語法問題吧。
因此,上邊這條語句是執行的一個cross join,如果my_latlng表有1000wt條記錄,而my_grid有10000w條記錄時,這樣的一個cross join 在加上 where條件,就會導致這個數據在集群中1000spark套餐(假設說1spark套餐:1vcore cpu+12g memory+500g disk。),5個小時也無法分析出結果。
- 解決方案二:
我們知道經緯度小數點第5位代表的基本就是米單位,緯度30°時,經度每變化0.00001相當於變化1.1m。因此,我們可以粗略的認為5m的柵格在經度、緯度上的變化為0.00005個單位的變化。
因此,利用上邊的這個特性我們可以有以下方案:
第一步、可以找到某些柵格距離自己緯度和經度變化接近5m的周圍8+1個左右的柵格;
( rpad(t10.lat+0.00005,7,‘0‘)=rpad(t11.centerlat,7,‘0‘) or rpad(t10.lat,7,‘0‘)=rpad(t11.centerlat+0.00005,7,‘0‘) or rpad(t10.lat,7,‘0‘)=rpad(t11.centerlat,7,‘0‘) ) and ( rpad(t10.lng+0.00005,8,‘0‘)=rpad(t11.centerlng,8,‘0‘) or rpad(t10.lng,8,‘0‘)=rpad(t11.centerlng+0.00005,8,‘0‘) or rpad(t10.lng,8,‘0‘)=rpad(t11.centerlng,8,‘0‘) )
備註:我們計算範圍:經度範圍100.0到180.0,緯度範圍為:10.0到90.0。
第二步、從第一步中的柵格中挑選距離自己最近的一個柵格作為自己歸屬柵格。
(
(cast(t10.lng as double)-t11.centerlng)*(cast(t10.lng as double)-t11.centerlng)
+(cast(t10.lat as double)-t11.centerlat)*(cast(t10.lat as double)-t11.centerlat)
) distans
但是上邊的程序如果放在hive中的語句因該是這麽寫:
select t11.gridid,t10.lat,t10.lng,(
(cast(t10.lng as double)-t11.centerlng)*(cast(t10.lng as double)-t11.centerlng)
+(cast(t10.lat as double)-t11.centerlat)*(cast(t10.lat as double)-t11.centerlat)) distans from my_latlng t10 inner join my_grid t11 where ( rpad(t10.lat+0.00005,7,‘0‘)=rpad(t11.centerlat,7,‘0‘) or rpad(t10.lat,7,‘0‘)=rpad(t11.centerlat+0.00005,7,‘0‘) or rpad(t10.lat,7,‘0‘)=rpad(t11.centerlat,7,‘0‘) ) and(
rpad(t10.lng+0.00005,8,‘0‘)=rpad(t11.centerlng,8,‘0‘) or rpad(t10.lng,8,‘0‘)=rpad(t11.centerlng+0.00005,8,‘0‘) or rpad(t10.lng,8,‘0‘)=rpad(t11.centerlng,8,‘0‘) );
但是上邊的程序是有以下兩個問題:
問題1)inner join 沒有 on 條件,原因是where中語句不允許寫到on中,也是hive的語法問題;
問題2)上邊這段代碼也是執行的cross join,而此自然執行的也很慢。
好的事情是:
1)從這段代碼中,我們是可以把多個語句拆分出9個語句,而且這9個語句是可以具有on條件的;
2)之後把9個語句分析的結果union all後的結果,再進行按照my_latlng.lat,my_latlng.lng分組求出具體每個經緯度的最小距離值;
3)使用“my_latlng的經緯度+最小距離”與“union all後的結果”進行一次inner join,就可以得到具體每個經緯度對應的柵格id。
具體代碼:
hiveContext.sql("create table my_latlng_gridid_distance(gridid bigint,lat string,lng string,distance decimal(38,5))") hiveContext.sql("create table my_latlng_mindistance(lat string,lng string,min_distans decimal(38,5))") hiveContext.sql("create table my_latlng_gridid_result(gridid bigint,lat string,lng string)") hiveContext.sql("select t11.gridid,t10.lat,t10.lng,cast(((cast(t10.lng as double)-t11.centerlng)*(cast(t10.lng as double)-t11.centerlng)
+(cast(t10.lat as double)-t11.centerlat)*(cast(t10.lat as double)-t11.centerlat)) *10000000000000 as decimal(38,5)) distans
from my_latlng t10 inner join my_grid t11
on rpad(t10.lat+0.00005,7,‘0‘)=rpad(t11.centerlat,7,‘0‘) and (rpad(t10.lng+0.00005,8,‘0‘)=rpad(t11.centerlng,8,‘0‘)").registerTempTable("temp00") hiveContext.sql("insert into my_latlng_gridid_distance select * from temp00") hiveContext.sql("select t11.gridid,t10.lat,t10.lng,cast(((cast(t10.lng as double)-t11.centerlng)*(cast(t10.lng as double)-t11.centerlng)
+(cast(t10.lat as double)-t11.centerlat)*(cast(t10.lat as double)-t11.centerlat)) *10000000000000 as decimal(38,5)) distans
from my_latlng t10 inner join my_grid t11
on rpad(t10.lat+0.00005,7,‘0‘)=rpad(t11.centerlat,7,‘0‘) and rpad(t10.lng,8,‘0‘)=rpad(t11.centerlng+0.00005,8,‘0‘)").registerTempTable("temp01") hiveContext.sql("insert into my_latlng_gridid_distance select * from temp01") hiveContext.sql("select t11.gridid,t10.lat,t10.lng,cast(((cast(t10.lng as double)-t11.centerlng)*(cast(t10.lng as double)-t11.centerlng)
+(cast(t10.lat as double)-t11.centerlat)*(cast(t10.lat as double)-t11.centerlat)) *10000000000000 as decimal(38,5)) distans
from my_latlng t10 inner join my_grid t11
on rpad(t10.lat+0.00005,7,‘0‘)=rpad(t11.centerlat,7,‘0‘) and rpad(t10.lng,8,‘0‘)=rpad(t11.centerlng,8,‘0‘)").registerTempTable("temp02") hiveContext.sql("insert into my_latlng_gridid_distance select * from temp02") hiveContext.sql("select t11.gridid,t10.lat,t10.lng,cast(((cast(t10.lng as double)-t11.centerlng)*(cast(t10.lng as double)-t11.centerlng)
+(cast(t10.lat as double)-t11.centerlat)*(cast(t10.lat as double)-t11.centerlat)) *10000000000000 as decimal(38,5)) distans
from my_latlng t10 inner join my_grid t11
on rpad(t10.lat,7,‘0‘)=rpad(t11.centerlat+0.00005,7,‘0‘) and (rpad(t10.lng+0.00005,8,‘0‘)=rpad(t11.centerlng,8,‘0‘)").registerTempTable("temp10") hiveContext.sql("insert into my_latlng_gridid_distance select * from temp10") hiveContext.sql("select t11.gridid,t10.lat,t10.lng,cast(((cast(t10.lng as double)-t11.centerlng)*(cast(t10.lng as double)-t11.centerlng)
+(cast(t10.lat as double)-t11.centerlat)*(cast(t10.lat as double)-t11.centerlat)) *10000000000000 as decimal(38,5)) distans
from my_latlng t10 inner join my_grid t11
on rpad(t10.lat,7,‘0‘)=rpad(t11.centerlat+0.00005,7,‘0‘) and rpad(t10.lng,8,‘0‘)=rpad(t11.centerlng+0.00005,8,‘0‘)").registerTempTable("temp11") hiveContext.sql("insert into my_latlng_gridid_distance select * from temp11") hiveContext.sql("select t11.gridid,t10.lat,t10.lng,cast(((cast(t10.lng as double)-t11.centerlng)*(cast(t10.lng as double)-t11.centerlng)
+(cast(t10.lat as double)-t11.centerlat)*(cast(t10.lat as double)-t11.centerlat)) *10000000000000 as decimal(38,5)) distans
from my_latlng t10 inner join my_grid t11
on rpad(t10.lat,7,‘0‘)=rpad(t11.centerlat+0.00005,7,‘0‘) and rpad(t10.lng,8,‘0‘)=rpad(t11.centerlng,8,‘0‘)").registerTempTable("temp12") hiveContext.sql("insert into my_latlng_gridid_distance select * from temp12") hiveContext.sql("select t11.gridid,t10.lat,t10.lng,cast(((cast(t10.lng as double)-t11.centerlng)*(cast(t10.lng as double)-t11.centerlng)
+(cast(t10.lat as double)-t11.centerlat)*(cast(t10.lat as double)-t11.centerlat)) *10000000000000 as decimal(38,5)) distans
from my_latlng t10 inner join my_grid t11
on rpad(t10.lat,7,‘0‘)=rpad(t11.centerlat,7,‘0‘) and (rpad(t10.lng+0.00005,8,‘0‘)=rpad(t11.centerlng,8,‘0‘)").registerTempTable("temp20") hiveContext.sql("insert into my_latlng_gridid_distance select * from temp20") hiveContext.sql("select t11.gridid,t10.lat,t10.lng,cast(((cast(t10.lng as double)-t11.centerlng)*(cast(t10.lng as double)-t11.centerlng)
+(cast(t10.lat as double)-t11.centerlat)*(cast(t10.lat as double)-t11.centerlat)) *10000000000000 as decimal(38,5)) distans
from my_latlng t10 inner join my_grid t11
on rpad(t10.lat,7,‘0‘)=rpad(t11.centerlat,7,‘0‘) and rpad(t10.lng,8,‘0‘)=rpad(t11.centerlng+0.00005,8,‘0‘)").registerTempTable("temp21") hiveContext.sql("insert into my_latlng_gridid_distance select * from temp21") hiveContext.sql("select t11.gridid,t10.lat,t10.lng,cast(((cast(t10.lng as double)-t11.centerlng)*(cast(t10.lng as double)-t11.centerlng)
+(cast(t10.lat as double)-t11.centerlat)*(cast(t10.lat as double)-t11.centerlat)) *10000000000000 as decimal(38,5)) distans
from my_latlng t10 inner join my_grid t11
on rpad(t10.lat,7,‘0‘)=rpad(t11.centerlat,7,‘0‘) and rpad(t10.lng,8,‘0‘)=rpad(t11.centerlng,8,‘0‘)").registerTempTable("temp22") hiveContext.sql("insert into my_latlng_gridid_distance select * from temp22") hiveContext.sql("select lat,lng,min(distans) as min_distans " + "from my_latlng_gridid_distance " + "group by lat,lng").repartition(200).persist().registerTempTable("temp_10000") hiveContext.sql("insert into my_latlng_mindistance select * from temp_10000") hiveContext.sql("select t11.gridid,t11.lat,t11.lng " + "from my_latlng_mindistance as t10 " + "inner join my_latlng_gridid_distance as t11 " + "on t10.lat=t11.lat and t10.lng=t11.lng and t10.min_distans=t11.distans") .distinct() // must use distinct .repartition(200).persist().registerTempTable("temp_20000") hiveContext.sql("insert into my_latlng_gridid_result select * from temp_20000")
給定了經緯度的一張my_latlng表,和一個柵格my_grid表,怎麽實現my_latlng表回填柵格id?