HiveSQL高级进阶技巧案例(中)


1)经纬度计算距离

select t1.uid,t2.uid,
t1.latitude,t1.longitude,t2.latitude,t2.longitude,
-- 距离
6378137*2*ASIN(SQRT(POWER(SIN((t1.longitude-t2.longitude)*ACOS(-1)/360),2)
+ COS(t1.longitude*ACOS(-1)/180)*COS(t2.longitude*ACOS(-1)/180)*POWER(SIN((t1.longitude-t2.longitude)*ACOS(-1)/360),2))) as distance_

from ( select uid,latitude,longitude
from t1
where ) t1

left join (
select uid,latitude,longitude
from t2
where
) t2
on t1.uid = t2.uid

2)带emoji表情/纯emoji表情

-- 含表情

size(extract_emoji(regexp_replace(text_comment_content, '@.*?\)', ''))) > 0

-- 纯表情

size(extract_emoji(regexp_replace(text_comment_content, '@.*?\)', ''))) = length(text_comment_content)

3)13bigint转时间戳

-- 13位bigint 转 yyyy-MM-dd HH:dd:ss.sss

concat(from_unixtime(cast(substr(1234567890123,0,10) as bigint),'yyyy-MM-dd HH:dd:ss'),'.',substr(1234567890123,11,13))

-- 取13位bigint当前系统时间戳

select bigint(double(current_timestamp())*1000)

4)连续登录天数-升级版

-- 用户最近一次连续观看博主天数(注:连续天数>=2;若博主未开播,用户观看连续状态不中断)

select uid,
bid,
continue_play_days -- 连续观看天数(最近一次)
from (
select uid,
bid,
dt,
continue_play_days
from
(
select uid,bid,dt,rn_diff,
sum(1) over(partition by uid,bid,rn_diff) as continue_play_days,
row_number()over(partition by uid,bid order by rn_diff) as rn
from
(
select
coalesce(t1.uid,-1) as uid, t2.bid, t2.dt ,
row_number()over(partition by t1.uid,t2.bid order by t2.dt desc ) as rn_play ,rn_author,
rn_author - row_number()over(partition by t1.uid, t2.bid order by t2.dt desc ) as rn_diff
from
( select uid, -- 用户ID
bid, -- 博主ID
dt
from t1 -- 用户观看博主表
where dt..
group by uid,
bid,
dt) t1
full join (

select bid,dt,
row_number() over(partition by bid order by dt desc ) as rn_author
from (
select bid, -- 博主ID
dt
from t2 -- 博主开播表
where dt..
group by bid,
dt) u
) t2
on t1.bid = t2.bid and t1.dt=t2.dt
) s
where uid <> -1
) ss
where rn = 1
) t
where continue_play_days >= 2

5)lateral view outer 坑、丢数

当explode函数里传入的数据是否为null,lateral view explode(null) tmp as id 时,结果不显示任何数据**(注意:是指其他字段的数据也不返回,原表数据丢失**很容易被忽视);

lateral view outer explode(null) tmp as id 时,结果显示其他字段是有数据的,但id显示为null,原表数据不会丢失

SELECT *
FROM t1
lateral view outer explode(get_json_object(`data`, '$.key')) t2 AS notice
WHERE dt = 

6)多维分析 grouping=1

在cube多维处理时,对维度组合中的维度用coalease或nvl处理存在隐患,无法区分维度本身的取值为null,建议用grouping()处理,处理逻辑如下:

if(grouping(xxx)=1,'ALL',xxx) as xxx

附官方截图:

HiveSQL高级进阶技巧案例(中)

7)引擎不一致,执行结果不同

to_json(named_struct()) 拼接json字符串时MR、Spark引擎返回结果不一致

to_json(named_struct('userID',userid)) 拼接json字符串中key值存在大写字母时,在spark执行时是保持原值的,但是在mr执行时都会转成小写字母。两个引擎返回结果不一致。


大家有在工作中碰到吗,欢迎一起讨论学习

发表评论
留言与评论(共有 0 条评论) “”
   
验证码:

相关文章

推荐文章