1. 程式人生 > >hive之異常值處理

hive之異常值處理

NULL值型別

count(col_name) 如果col_name的值是NULL,那麼COUNT是不會把它算進去的,所以想統計所有日誌數要使用COUNT(1)

而想對非空列進行相關操作,需要使用col_name IS NOT NULL. 而不是LENGTH(col_name>1), 因為LENGTH(NULL)是沒有結果的

---------------------------------------------------------20170608更新---------------------------------------------------------

除了上述問題,null值在邏輯統計方面也帶來一些麻煩。樓主在使用時,還遇到了如下問題。

問題描述:有條件A、B、C,想統計全部滿足任意不滿足其中一種情況的下的資料量。

原始程式碼如下:

SELECT
   COUNT(1) AS all_user
,  SUM(CASE WHEN A AND B AND C THEN 1 ELSE 0 END) AS ok_user
,  SUM(CASE WHEN A AND B AND C THEN 0 ELSE 1 END) AS case_user
,  SUM(CASE WHEN !A THEN 1 ELSE 0 END) AS not_A
,  SUM(CASE WHEN !B THEN 1 ELSE 0 END) AS not_B
,  SUM(CASE WHEN !C THEN 1 ELSE 0 END) AS not_C
FROM tb
按理說應該是not_A+not_B+not_C的值不小於case_user才對,但是樓主得到的數量是小於。一頓困惑後經大神點播發現了原因,還是NULL值作祟。
修改後的程式碼如下:
SELECT
   COUNT(1) AS all_user
,  SUM(CASE WHEN A AND B AND C THEN 1 ELSE 0 END) AS ok_user
,  SUM(CASE WHEN A AND B AND C THEN 0 ELSE 1 END) AS case_user
,  SUM(CASE WHEN !A OR A IS NULL THEN 1 ELSE 0 END) AS not_A
,  SUM(CASE WHEN !B OR B IS NULL THEN 1 ELSE 0 END) AS not_B
,  SUM(CASE WHEN !C OR C IS NULL THEN 1 ELSE 0 END) AS not_C
FROM tb

上述查詢只能是一個全域性統計,並不能瞭解每一個user的情況。也就是說,如果使用者存在多條記錄,有的記錄是滿足A AND B AND C的,有的是不滿足的,那怎麼統計每種條件下每個使用者的記錄滿足情況呢?~~~~~~~使用MIN和MAX~~~~~~~

SELECT
  COUNT(1) AS all_user
, SUM(ok) AS all_ok
, SUM(not_A) AS all_not_A
, SUM(not_B) AS all_not_B
, SUM(not_C) AS all_not_C
FROM(
    SELECT
      userid
    , MIN(CASE WHEN A AND B AND C THEN 1 ELSE 0 END) AS ok
    , MAX(CASE WHEN !A OR A IS NULL THEN 1 ELSE 0 END) AS not_A
    , MAX(CASE WHEN !B OR B IS NULL THEN 1 ELSE 0 END) AS not_B
    , MAX(CASE WHEN !C OR C IS NULL THEN 1 ELSE 0 END) AS not_C
    FROM tb
    GROUP BY userid
) a

NaN值型別

利用sql進行一些資料處理操作時,有時會得到異常結果。比如當分母為0的時候,sql不會報錯但是結果會是NaN。 可利用如下程式碼過濾這些異常值
select col_1, col_2, col_with_nan
from my_table
where some_conditions
  and cast(col_with_nan as String) <> 'NaN';