1. 程式人生 > >Hive的基本使用(處理資料)

Hive的基本使用(處理資料)

啟動上一篇搭建的hive叢集 sh hive-start.sh 隨便一個資料夾下載檔案,只要自己記住就好: wget https://raw.githubusercontent.com/ffzs/dataset/master/Questionnaire.csv 開啟hive及beeline:

schematool -dbType mysql -initSchema
nohup hiveserver2 1>/home/hadoop/hiveserver.log 2>/home/hadoop/hiveserver.err &
beeline -u jdbc:hive2://hadoop1:10000 -n root 

建立database: create database my; use my;

說一下資料: 一共九列,從左到右分別是:性別,國籍,年齡,工作,資料科學工作首選語言,教育情況,所學專業,從事資料科學工作時間,父母教育情況 建立表:

create table qn(gender string,
country string,
age int,
job string,
language string,
education string,
major string, 
tenure string,
parentseducation string) 
row format delimited fields
terminated by ',' stored as textfile ;

在這裡插入圖片描述 匯入本地資料,本地資料要加local否者hive會去hdfs上找:

load data local inpath '/home/data/Questionnaire.csv' overwrite into table qn;

看一下資料: 在這裡插入圖片描述

受訪者國家分佈情況:

create table country as 
select qn.country as country, count(qn.country) as count
from qn where qn.country!='Other' and qn.country!=
'' group by qn.country order by count desc;

看一下前十名分別來著哪些國家: select * from country limit 10;

在這裡插入圖片描述 可見美國和印度參與調查的人比較多。

各國受訪者年齡中位數:

create table age as 
select qn.country as country,percentile(qn.age, 0.5) as median
from qn where qn.country!='Other' and qn.country!=''
group by qn.country  
order by median desc;

看一下前十: select * from age limit 10; 在這裡插入圖片描述 看來紐西蘭和一些歐洲國家的資料科學家年齡稍微偏大一些。

人數大於400人的國家受訪者年齡中位數

select c.country as country ,c.count as count, a.median as age 
from country as c, age as a
where c.country= a.country and c.count > 400
order by count desc;

在這裡插入圖片描述 可見中國、印度發展中國家資料可數學家更年輕化。

人數前十的國家受訪者年齡中位數:

select c.country as country ,c.count as count, a.median as age
from country c left join age a on c.country= a.country
order by count desc
limit 10;

受訪者工作分佈情況

create table job as 
select qn.job as job, count(qn.job) as count
from qn where qn.job!='Other' and qn.job!=''
group by qn.job  
order by count desc;

前十:

在這裡插入圖片描述 資料科學家最多。

程式語言分佈情況

create table language as 
select qn.language as language, count(qn.language) as count
from qn where qn.language!='Other' and qn.language!=''
group by qn.language  
order by count desc;

在這裡插入圖片描述 python 和 R以其易用性名列前茅。

python使用者年齡中位數大於30的國家分佈情況:

select qn.country as country, count(qn.country) as count 
from qn 
where language='Python'
group by qn.country
having percentile(qn.age, 0.5) > 30
order by count desc
limit 10;

在這裡插入圖片描述

各國家受訪者受教育水平人數最多的分類

先取到各個國家各受教育程度人數

create table result1 as
select country, education ,count(gender) as count, GROUPING__ID
from qn 
group by country,education 
grouping sets (( country, education)) 
order by count desc;

依據國家分組求出個部分的進行排序,人數最多標為1

create table result2 as
select country, education ,count, 
row_number() over (partition by country order by count desc) as num
from result1 
where country!='Other' and country!=''
order by num;

選出num=1

create table result3 as
select country, education, count 
from result2
where num=1
order by count desc;

結果:

在這裡插入圖片描述 看了一下不是學士就是碩士。

儲存到hdfs上:

insert overwrite directory "/questionnaire/result3/" select * from result3;