经典HQL总结

Hive 执行逻辑

driver
compiler sql-》抽象语法树
optimizer 逻辑执行计划 物理执行计划
executor 

星型模型,数据有一定冗余,但是SQL的编写相对容易
雪花模型,数据冗余度很低,SQL的复杂度就很高

数据主题域
用户:user 
商品:info 
  属性:param
交易:trade 
  订单:order
  优惠券:coupon
  回收:recycle
  租赁:lease
  维修:repair
流量:traffic 
  用户行为:ub
  归因:attribute
  直播:live
供应链:scm 
  物流:logistics
  采购:pur
  仓储:store
营销:market 
  内容:content
  活动:activity
服务:service 
  客服:csc
  售后:afs
  鉴定:auth
  质检:qc
商业广告:biz 
  线索:clue
  客户:customer
  充值:recharge
  账务:account
主题域 子域
  活动:activity
  推广:promote
  广告:ad
  增值产品:valueadd
  用户:user
渠道:channel 
地址:address 
财务:finance 
风控:spam 
竞品:compete 

 

视图

view 视图 
CTAS    create table ... as select ....
    table: 物理表
        后面的select会执行,得到的结果就是table的数据
CVAS    create view  .... as select ....
    view:逻辑表
        后面的select不执行,只是记录了这一条SQL
        当你需要从view中拿出数据的时候,那么这个SQL就会执行

view: 特殊的表
      一条复杂SQL的别名
      类似于子查询中的虚表

创建,查看,删除,查询详细信息
create view .... as select ...
show views;
desc view view_name;
drop view_name:


怎么选择呢?
如果这个select的结果很多地方要用。那么就变成 table
写了一个非常复杂的SQL,为了可读性。用view

分桶的规则,就类似于 mapreduce 中的 默认分区规则: HashPartitioner
        
 1000T的数据,想求最大的3个?  
 分而治之:500个task 每个task取2T,取top3 ;最后order by [求全局有序]

经典SQL

1. 统计出每个用户的累积访问次数 -- sum (xx) over ()

考察点 

a.DATE_FORMAT(regexp_replace(visitDate,"/","-"),'yyyy-MM')

b. SUM(t2.xj) over(PARTITION BY t2.userId ORDER BY t2.visitDate) lj 

order by 时间就是累计  只是分区就是求和

输入

userId	visitDate	visitCount
u01		2017/1/21	5
u02		2017/1/23	6
u03		2017/1/22	8
u04		2017/1/20	3
u01		2017/1/23	6
u01		2017/2/21	8
u02		2017/1/23	6
u01		2017/2/22	4

输出

用户id	月份	 小计	累积
u01		2017-01	 11		11
u01		2017-02	 12		23
u02		2017-01	 12		12
u03		2017-01	 8		8
u04		2017-01	 3		3

解题思路

a. 先处理日期格式

SELECT userId,  
DATE_FORMAT(regexp_replace(visitDate,"/","-"),'yyyy-MM') visitDate , visitCount
FROM ACTION    ==>t1
 

b. 根据用户id和月份分组,求小计
SELECT t1.userId, t1.visitDate, SUM(visitCount) xj
FROM t1 
GROUP BY userId,visitDate   ==>t2

组合:
SELECT t1.userId, t1.visitDate, SUM(visitCount) xj
FROM (SELECT userId,  
DATE_FORMAT(regexp_replace(visitDate,"/","-"),'yyyy-MM') visitDate , visitCount
FROM ACTION)t1 
GROUP BY userId,visitDate

+------------+---------------+-----+--+
| t1.userid  | t1.visitdate  | xj  |
+------------+---------------+-----+--+
| u01        | 2017-01       | 11  |
| u01        | 2017-02       | 12  |
| u02        | 2017-01       | 12  |
| u03        | 2017-01       | 8   |
| u04        | 2017-01       | 3   |
+------------+---------------+-----+--+


c. 开窗,通过userId分区,visitData排序, 求累计

SELECT t2.userId,t2.visitDate, t2.xj ,  
SUM(t2.xj) over(PARTITION BY t2.userId ORDER BY t2.visitDate) lj
FROM t2 

SQL实现

SELECT t2.userId,t2.visitDate, t2.xj ,  
SUM(t2.xj) over(PARTITION BY t2.userId ORDER BY t2.visitDate) lj
FROM (SELECT t1.userId, t1.visitDate, SUM(visitCount) xj
FROM (SELECT userId,  
DATE_FORMAT(regexp_replace(visitDate,"/","-"),'yyyy-MM') visitDate , visitCount
FROM ACTION)t1 
GROUP BY userId,visitDate)t2 

2.  rank

1)每个店铺的UV(访客数)

2)每个店铺访问次数top3的访客信息。输出店铺名称、访客id、访问次数

输入

+----------------+-------------+--+
| visit.user_id  | visit.shop  |
+----------------+-------------+--+
| u1             | a           |
| u2             | b           |
| u1             | b           |
| u1             | a           |
| u3             | c           |
| u4             | b           |
| u1             | a           |
| u2             | c           |
| u5             | b           |
| u4             | b           |
| u6             | c           |
| u2             | c           |
| u1             | b           |
| u2             | a           |
| u2             | a           |
| u3             | a           |
| u5             | a           |
| u5             | a           |
| u5             | a           |
+----------------+-------------+--+

思路

a.分组统计次数
b.分组按照次数降序排序
c.取出前3

1)每个店铺的UV(访客数)

SELECT shop ,COUNT(DISTINCT user_id) uv 
FROM visit
GROUP BY shop 

结果

+-------+-----+--+
| shop  | uv  |
+-------+-----+--+
| a     | 4   |
| b     | 4   |
| c     | 3   |
+-------+-----+--+

2)每个店铺访问次数top3的访客信息。输出店铺名称、访客id、访问次数

解题思路

需求二:每个店铺访问次数top3的访客信息。输出店铺名称、访客id、访问次数

a. 通过user_id, shop分组,求每个店铺,每个访客访问的次数
SELECT user_id, shop ,COUNT(user_id) visit_count
FROM visit
GROUP BY user_id, shop   ==>t1

+----------+-------+--------------+--+
| user_id  | shop  | visit_count  |
+----------+-------+--------------+--+
| u1       | a     | 3            |
| u1       | b     | 2            |
| u2       | a     | 2            |
| u2       | b     | 1            |
| u2       | c     | 2            |
| u3       | a     | 1            |
| u3       | c     | 1            |
| u4       | b     | 2            |
| u5       | a     | 3            |
| u5       | b     | 1            |
| u6       | c     | 1            |
+----------+-------+--------------+--+

b.按照shop分区,访问次数排序,求排名

SELECT t1.user_id ,t1.shop ,t1.visit_count,
rank() over(distribute BY t1.shop sort BY t1.visit_count DESC ) rk
FROM t1   ==>t2

组合:
SELECT t1.user_id ,t1.shop ,t1.visit_count,
rank() over(distribute BY t1.shop sort BY t1.visit_count DESC ) rk
FROM (SELECT user_id, shop ,COUNT(user_id) visit_count
FROM visit
GROUP BY user_id, shop )t1  

+-------------+----------+-----------------+-----+--+
| t1.user_id  | t1.shop  | t1.visit_count  | rk  |
+-------------+----------+-----------------+-----+--+
| u5          | a        | 3               | 1   |
| u1          | a        | 3               | 1   |
| u2          | a        | 2               | 3   |
| u3          | a        | 1               | 4   |
| u4          | b        | 2               | 1   |
| u1          | b        | 2               | 1   |
| u5          | b        | 1               | 3   |
| u2          | b        | 1               | 3   |
| u2          | c        | 2               | 1   |
| u6          | c        | 1               | 2   |
| u3          | c        | 1               | 2   |
+-------------+----------+-----------------+-----+--+



c. 取top3
SELECT t2.user_id, t2.shop, t2.visit_count 
FROM t2 
WHERE t2.rk <=3

组合: 
SELECT t2.user_id, t2.shop, t2.visit_count 
FROM (SELECT t1.user_id ,t1.shop ,t1.visit_count,
rank() over(distribute BY t1.shop sort BY t1.visit_count DESC ) rk
FROM (SELECT user_id, shop ,COUNT(user_id) visit_count
FROM visit
GROUP BY user_id, shop )t1  )t2 
WHERE t2.rk <=3

+-------------+----------+-----------------+--+
| t2.user_id  | t2.shop  | t2.visit_count  |
+-------------+----------+-----------------+--+
| u5          | a        | 3               |
| u1          | a        | 3               |
| u2          | a        | 2               |
| u4          | b        | 2               |
| u1          | b        | 2               |
| u5          | b        | 1               |
| u2          | b        | 1               |
| u2          | c        | 2               |
| u6          | c        | 1               |
| u3          | c        | 1               |
+-------------+----------+-----------------+--+

 

3 .漏斗模型--转化率

输入

step, name, pv
1,广告,10000
2,菜单,3000
3,商品详情,2600
4,购物车,300
5,下单,200
6,支付,190
7,支付成功,189

输出

step     name       pv          pct         
1        广告       10000      	100.0            
2        菜单       3000       	30.0            
3        商品详情   2600       	86.66666666666667      
4        购物车     300      	11.538461538461538     
5        下单       200      	66.66666666666666      
6        支付       190      	95.0            
7        支付成功   189      	99.47368421052632  

a.相对于最初步骤的转化率
 

set hive.mapred.mode = nonstrict;
select aa.step as step, aa.name as name, aa.apv as pv, aa.apv/aa.bpv * 100 as
pct from
(select a.step as step, a.name as name, a.pv as apv, b.bpv as bpv
from t_order a, (select max(pv) as bpv from t_order) b) aa;

b.每一步骤相对于上一步骤的转化率

create table order_trans as
select a.step as step, a.name as name, a.pv as apv, b.pv as bpv
from t_order a join t_order b
on a.step = b.step + 1
where a.step != 1
union
select a.step as step, a.name as name, a.pv as apv, a.pv as bpv
from t_order a where a.step == 1;
select a.step as step, a.name as name, a.apv as pv, a.apv/a.bpv * 100 as pct
from order_trans a;

 

select a.step as step, a.name as name, a.pv as apv, b.pv as bpv
from ld a join ld b
on a.step = b.step + 1
where a.step != 1

select a.step as step, a.name as name, a.pv as apv, a.pv as bpv
from ld a where a.step = 1;

select a.step as step, a.name as name, a.pv as apv, b.pv as bpv
from ld a join ld b
on a.step = b.step + 1
where a.step != 1
union
select a.step as step, a.name as name, a.pv as apv, a.pv as bpv
from ld a where a.step = 1;

建表

DROP TABLE ld;
Create table If Not Exists ld (step int,name varchar(50), pv INT)
# 中文
alter table ld change name name varchar(255) 
character set utf8 collate utf8_unicode_ci not null DEFAULT ''; 

insert into ld (step,name,pv) VALUES ('1',"广告",'10000');
insert into ld (step,name,pv) values (2,"菜单",3000);
insert into ld (step,name,pv) values (3,"商品详情",2600);
insert into ld (step,name,pv) values (4,"购物车",300);
insert into ld (step,name,pv) values (5,"下单",200);
insert into ld (step,name,pv) values (6,"支付",190);
insert into ld (step,name,pv) values (7,"支付成功",189);

 

4.每个用户截止到每月为止的最大单月访问次数和累计到该月的总访问次数 --  sum (xx) over ()    max(xx) over ()

考察点 考察窗口函数

max(a.pv) over (partition by a.name order by a.name,a.month rows between unbounded preceding and current row) as maxpv,
sum(a.pv) over (partition by a.name order by a.name,a.month rows between unbounded preceding and current row)  as sumpv

输入

用户名,月份,访问次数
name,month,pv
A,2020-01,5
A,2020-01,15
B,2020-01,5
A,2020-01,8
B,2020-01,25
A,2020-01,5
A,2020-02,4
A,2020-02,6
B,2020-02,10
B,2020-02,5
A,2020-03,16
A,2020-03,22
B,2020-03,23
B,2020-03,10
B,2020-03,11

输出

用户	月份		当月访问次数	最大访问次数	总访问次数		
A       2020-01		33				33				33
A       2020-02		10				33				43
A       2020-03		38				38				81
B       2020-01		30				30				30
B       2020-02		15				30				45
B       2020-03		44				44				89

sql实现1

create table table2 as 
select a.name, a.month, a.pv, 
max(a.pv) over (partition by a.name order by a.name,a.month rows between unbounded preceding and current row) as maxpv,
sum(a.pv) over (partition by a.name order by a.name,a.month rows between unbounded preceding and current row)  as sumpv
from table1 a;

解释

如果不指定 ROWS BETWEEN,默认为从起点到当前行;
如果不指定 ORDER BY,则将分组内所有值累加;
关键是理解ROWS BETWEEN含义,也叫做WINDOW子句:
PRECEDING:往前
FOLLOWING:往后
CURRENT ROW:当前行
UNBOUNDED:起点,UNBOUNDED PRECEDING 表示从前面的起点, UNBOUNDED FOLLOWING:表示到后面的终点

sql实现2  -- 自连接

步骤a

自连接(table a join table b on a.name = b.name)
 

create table table2 as 
select a.name as aname, a.month as amonth, a.pv as apv, 
b.name as bname, b.month as bmonth, b.pv as bpv 
from table1 a join table1 b on a.name = b.name;

结果

aname	amonth		apv	bmonth		bname	bpv
A       2020-01		33	2020-01		A	33

A       2020-01		33	2020-02		A	10	
A       2020-02		10	2020-02		A	10	

A       2020-01		33	2020-03		A	38
A       2020-02		10	2020-03		A	38
A       2020-03		38	2020-03		A	38

步骤2

select aa.bname, aa.bmonth, aa.bpv, max(aa.apv) as maxpv,  sum(aa.apv) as sumpv 
from (select * from table2 where amonth <= bmonth) aa
group by aa.bname, aa.bmonth, aa.bpv;

 

5.  行列转换

需求1: 所有数学课程成绩 大于 语文课程成绩的学生的学号

输入

id,学号,课程, 成绩
id,sid,course,score 
1, 1,  Chinese,43
2, 1,  Math,   55
3, 2,  Chinese,77
4, 2,  Math,   88
5, 3,  Chinese,98
6, 3,  Math,   65
7, 3,  English,80

如果是如下结果,即可求解  --一行一个成绩  -->  一列一个成绩

sid Chinese Math 	English
1 	43 		55 		0
2 	77 		88 		0 
3 	98 		65 		60

行转列

select sid, 
max(case course when "Chinese" then score else 0 end) as Chinese,
max(case course when "Math" then score else 0 end) as Math,
max(case course when "English" then score else 0 end) as English 
from course group by sid;

最终sql

select aa.sid from (
select sid, 
max(case course when "Chinese" then score else 0 end) as Chinese,
max(case course when "Math" then score else 0 end) as Math,
max(case course when "English" then score else 0 end) as English 
from course group by sid
) aa where aa.Math > aa.Chinese;

 

需求2 求每个课程选修的学生情况

输入

表示有id为1,2,3的学生选修了课程a,b,c,d,e,f中其中几门:
学生id,选修课程
id course 
1,a
1,b
1,c
1,e
2,a
2,c
2,d
2,f
3,a
3,b
3,c
3,e

输出

表中的1表示选修,表中的0表示未选修

id	a	b	c	d	e	f
1	1	1	1	0	1	0
2	1	0	1	1	0	1
3	1	1	1	0	1	0

解题思路

1. 先求出所有课程的集合
select collect_set(course) as courses from exercise0202;
["a","b","c","e","d","f"

开启笛卡尔积开关,编写SQL构造中间表数据
set hive.strict.checks.cartesian.product=false;
set hive.exec.mode.local.auto=true;


2.编写SQL实现 course 字段(当前id选修的课程)和 courses 字段(所有课程)
create table exercise_tmp as 
select t1.id as id,t1.course as course,t2.course courses 
from 
(select id as id,collect_set(course) as course from exercise_table group by id ) t1 
join 
(select collect_set(course) as course from exercise_table) t2;

结果:exercise_tmp
select id, course, courses from exercise_tmp;
id		course					courses
1       ["a","b","c","e"]       ["a","b","c","e","d","f"]
2       ["a","c","d","f"]       ["a","b","c","e","d","f"]
3       ["a","b","c","e"]       ["a","b","c","e","d","f"]


3.得出最终结果:
具体做法是:拿出course字段中的每一个元素在exercise_tmp中进行判断,看是否存在。
select id,
case when array_contains(course, courses[0]) then 1 else 0 end as a,
case when array_contains(course, courses[1]) then 1 else 0 end as b,
case when array_contains(course, courses[2]) then 1 else 0 end as c,
case when array_contains(course, courses[3]) then 1 else 0 end as d,
case when array_contains(course, courses[4]) then 1 else 0 end as e,
case when array_contains(course, courses[5]) then 1 else 0 end as f 
from exercise_tmp;

结果:
1       1       1       1       1       0       0
2       1       0       1       0       1       1
3       1       1       1       1       0       0


4.优化一下,得到最终的完整的SQL:
select abc.id,
case when array_contains(abc.course, abc.courses[0]) then 1 else 0 end as a,
case when array_contains(abc.course, abc.courses[1]) then 1 else 0 end as b,
case when array_contains(abc.course, abc.courses[2]) then 1 else 0 end as c,
case when array_contains(abc.course, abc.courses[3]) then 1 else 0 end as d,
case when array_contains(abc.course, abc.courses[4]) then 1 else 0 end as e,
case when array_contains(abc.course, abc.courses[5]) then 1 else 0 end as f 
from (
select t1.id as id,t1.course as course,t2.course courses 
from 
(select id as id,collect_set(course) as course from exercise_table group by id ) t1 
join 
(select collect_set(course) as course from exercise_table) t2
) abc;

5.使用with语法改写优化:
with 
t1 as (select id as id,collect_set(course) as course from exercise_table group by id),
t2 as (select collect_set(course) as course from exercise_table), 
abc as (select t1.id as id,t1.course as course,t2.course courses from t1 join t2) 
select abc.id,
case when array_contains(abc.course, abc.courses[0]) then 1 else 0 end as a,
case when array_contains(abc.course, abc.courses[1]) then 1 else 0 end as b,
case when array_contains(abc.course, abc.courses[2]) then 1 else 0 end as c,
case when array_contains(abc.course, abc.courses[3]) then 1 else 0 end as d,
case when array_contains(abc.course, abc.courses[4]) then 1 else 0 end as e,
case when array_contains(abc.course, abc.courses[5]) then 1 else 0 end as f 
from abc;

最终结果:
1       1       1       1       1       0       0
2       1       0       1       0       1       1
3       1       1       1       1       0       0

 

 

 

6. explode(array / map)  top-n问题

1. explode(array)

问题:求出每种爱好中,年龄最大的两个人(爱好,年龄,姓名)

注意:每一条记录中的爱好有多个值,以"-"分隔

输入 

id,姓名,年龄,爱好
id,name,age,favors
1,huangxiaoming,45,a-c-d-f
2,huangzitao,36,b-c-d-e
3,huanglei,41,c-d-e
4,liushishi,22,a-d-e
5,liudehua,39,e-f-d
6,liuyifei,35,a-d-e

输出

1       huangxiaoming   45      a
6       liuyifei        35      a
2       huangzitao      36      b
1       huangxiaoming   45      c
3       huanglei        41      c
1       huangxiaoming   45      d
3       huanglei        41      d
3       huanglei        41      e
5       liudehua        39      e
1       huangxiaoming   45      f
5       liudehua        39      f

select * from table where rank <=2;

思路分析

1、列转行 lateral view + explode()  示例: LATERAL VIEW explode(split(a.favors, "-")) favor_view as favor;
2、求TopN + row_number()

a. explode

select a.id,a.name,a.age, fv.favor as favor from exercise_expode_array_table a lateral view explode(split(favors, "-")) fv as favor;

结果

1,huangxiaoming,45,a
1,huangxiaoming,45,c
1,huangxiaoming,45,d
1,huangxiaoming,45,f

b.取最大年龄  -- 使用普通的分组聚合技巧就可以求得每种爱好中年龄最大的一个人

select aa.favor, max(aa.age) as maxage 
from 
(
select a.id as id, a.name as name, a.age as age,  favor_view.favor 
from exercise_expode_array_table a
LATERAL VIEW explode(split(a.favors, "-")) favor_view as favor
) aa 
group by aa.favor;

a       45
b       36
c       45
d       45
e       41
f       45

c.取最大年龄的2个人

select c.id, c.name, c.age, c.favor 
from 
(
select b.id, b.name, b.age, b.favor, 
row_number() over (partition by b.favor order by b.age desc) as rank 
from 
(
select a.id as id, a.name as name, a.age as age,  favor_view.favor 
from exercise_expode_array_table a
LATERAL VIEW explode(split(a.favors, "-")) favor_view as favor 
) b 
) c 
where c.rank <= 2;

4. 使用with语法改写

with 
b as (select a.id as id, a.name as name, a.age as age,  favor_view.favor 
from exercise_expode_array_table a LATERAL VIEW explode(split(a.favors, "-")) favor_view as favor),
c as (select b.id, b.name, b.age, b.favor, 
row_number() over (partition by b.favor order by b.age desc) as rank 
from b) 
select c.id, c.name, c.age, c.favor from c where c.rank <= 2;

2. explode(map)

题目 每个学生成绩最好的课程及分数、最差的课程及分数、平均分数

输入

huangbo Chinese:80,Math:89,English:95
xuzheng Chinese:70,Math:65,English:81
wangbaoqiang Chinese:75,Math:100,English:77

输出

huangbo			88.0    Chinese   80      English   95
wangbaoqiang    84.0    Chinese   75      Math  	100
xuzheng			72.0    Math 	  65      English   81

1.建表

create database if not exists exercise;
use exercise;
drop table if exists student_score;
create table if not exists student_score(name string, score map<string,int>) 
row format delimited fields terminated by ' ' 
collection items terminated by ',' 
map keys terminated by ':';
load data local inpath '/home/hadoop/student_score.txt' into table student_score;
select name, score from student_score;

2. 辅助SQL

辅助SQL
第一个:
select name, score["Chinese"] as Chinese, score["Math"] as Math, score["English"] as English from student_score;
结果:
huangbo			80      89      95
xuzheng			70      65      81
wangbaoqiang    75      100     77

第二个:
select explode(score) as (course, score) from student_score;
结果:
Chinese   80
Math  	  89
English   95
Chinese   70
Math  	  65
English   81
Chinese   75
Math  	  100
English   77

第三个:
select a.name, tt.course, tt.score from student_score a lateral view explode(a.score) tt as course, score;
结果:
huangbo			Chinese  80
huangbo			Math 	 89
huangbo			English  95
xuzheng			Chinese  70
xuzheng			Math  	 65
xuzheng			English  81
wangbaoqiang    Chinese  75
wangbaoqiang    Math  	 100
wangbaoqiang    English  77

第四个:
select a.name, max(tt.score) as maxscore, 
min(tt.score) as minscore,
avg(tt.score) as avgscore 
from student_score a 
lateral view explode(a.score) tt as course, score group by a.name;
结果:
huangbo			95      80      88.0
wangbaoqiang    100     75      84.0
xuzheng			81      65      72.0

 

 

 

 

 

 

相关推荐
©️2020 CSDN 皮肤主题: Age of Ai 设计师:meimeiellie 返回首页