有日志如下,请写出代码求得所有用户和活跃用户的总数及平均年龄。(活跃用户指连续两天都有访问记录的用户)
数据准备
最后需完成的结果表
步骤1,所有用户的总数及平均年龄
(1). 将数据去重
with t1 as (
select distinct
user_id,
age
from test5
)
(2). 对用户总数以及平均年龄进行计算
t2 as (
select count(user_id) as cnt,
avg(age) as avg_age
from t1
)
步骤2 活跃用户的总数及平均年龄, 活跃用户指连续两天都有访问记录的用户【连续N天登录问题】
(1). 对表格进行全局的数据清洗
t3 as (
select distinct dt,
user_id,
age
from test5
)
(2). 求连续活跃用户,先对每个用户进行标记。
标记方法:先将同一个用户分到一组,然后对dt
进行排序,用dt
减去序号。
注:一般求连续XXX的问题,基本都是用到row_number,rank等窗口函数增加标签列(伪列),然后用某列减去它们,如果是连续的,则减完后的结果是一致的,其他变相连续问题,也是该思考方式,主要是找到规律
t4 as (
select dt,
user_id,
age,
--同一个客户,按照不同日期排序,得到序号
row_number() over (partition by user_id order by dt) as rn
from t3
),
select * from t4
用dt
减去rn
得到天数
t5 as (
select *,
--用日期减去序号得到临时日期
date_sub(dt,rn) as date2
from t4
),
select * from t5
(3). 同一个用户data2
的结果是相同的,那就可以表明是连续登录的。且连续的次数大于等于2
即为连续活跃用户。
t6 as (--统计date2临时日期出现几次。如果2次则表示连续登陆2次
select user_id,
date2,
count(1) as cnt
from t5
group by user_id,date2
having count(1)>=2
),
select * from t6
(4). 一个用户可能有多次连续登录的情况,所以对上述结果去重
t7 as (
select distinct user_id,age
from t6
),
select * from t7
(5). 求活跃用户总数
t8 as (
select '活跃用户' as type,
count(user_id) as cnt,
avg(age) as avg_age
from t7
)
select * from t8;
完善t2
代码,并将它t2
与t8
合并,使用union all
t2 as (
select '所有用户' as type,
count(user_id) as cnt,
avg(age) as avg_age
from t1
)
select * from t2
union all
select * from t8;
完整代码
CREATE TABLE test5
(
dt string,
user_id string,
age int
) ROW format delimited fields terminated BY ',';
INSERT INTO TABLE test_sql.test5
VALUES ('2019-02-11', 'test_1', 23);
INSERT INTO TABLE test_sql.test5
VALUES ('2019-02-11', 'test_2', 19);
INSERT INTO TABLE test_sql.test5
VALUES ('2019-02-11', 'test_3', 39);
INSERT INTO TABLE test_sql.test5
VALUES ('2019-02-11', 'test_1', 23);
INSERT INTO TABLE test_sql.test5
VALUES ('2019-02-11', 'test_3', 39);
INSERT INTO TABLE test_sql.test5
VALUES ('2019-02-11', 'test_1', 23);
INSERT INTO TABLE test_sql.test5
VALUES ('2019-02-12', 'test_2', 19);
INSERT INTO TABLE test_sql.test5
VALUES ('2019-02-13', 'test_1', 23);
INSERT INTO TABLE test_sql.test5
VALUES ('2019-02-15', 'test_2', 19);
INSERT INTO TABLE test_sql.test5
VALUES ('2019-02-16', 'test_2', 19);
--有日志如下,请写出代码求得所有用户和活跃用户的总数及平均年龄。(活跃用户指连续两天都有 访问记录的用户)
--步骤 1 所有用户的总数及平均年龄
with t1 as (
select distinct
user_id,
age
from test5
),
t2 as (
select '所有用户' as type,
count(user_id) as cnt,
avg(age) as avg_age
from t1
),
--步骤 2 活跃用户的总数及平均年龄,活跃用户指连续两天都有 访问记录的用户)
t3 as (
select distinct dt,
user_id,
age
from test5
),
t4 as (
select dt,
user_id,
age,
--同一个客户,按照不同日期排序,得到序号
row_number() over (partition by user_id order by dt) as rn
from t3
),
t5 as (
select *,
--用日期减去序号得到临时日期
date_sub(dt,rn) as date2
from t4
),
t6 as (--统计date2临时日期出现几次。如果2次则表示连续登陆2次
select user_id,
date2,
max(age) age,
count(1) as cnt
from t5
group by user_id,date2
having count(1)>=2
),
t7 as (
select distinct user_id,age
from t6
),
t8 as (
select '活跃用户' as type,
count(user_id) as cnt,
avg(age) as avg_age
from t7
)
select * from t2 union all
select * from t8;