hive函数02
窗口函数
窗口函数(Window functions )也叫做开窗函数、OLAP函数,其最大特点是∶输入值是从SELECT语句的结果集中的一行或多行的“窗口”中获取的。
窗口函数可以简单地解释为类似于聚合函数的计算函数,但是通过GROUP BY子句组合的常规聚合会隐藏正在聚合的各个行,最终输出一行,窗口函数聚合后还可以访问当中的各个行,并且可以将这些行中的某些属性添加到结果集中。
select * from stu_mark;
-- 常规分组查询 求分数和
select sname,sum(score) from stu_mark group by sname;
-- 窗口函数分组
select sname,subject,score,sum(score) over (partition by sname) as total_score from stu_mark;
1 窗口函数语法规则
Function(arg1,.. ., argn) OVER( [PARTITION BY <...>][ORDER BY <....>][<window_expression>])
-- 其中Function (argl , ..., argn)可以是下面分类中的任意一个
-- 聚合函数:比如sum max avg等
-- 排序函数:比如rank row_number等
-- 分析函数:比如lead lag first value等
-- OVER [PARTITION BY<...>] 类似于group by用于指定分组每个分组你可以把它叫做窗口
-- 如果没有PARTITION BY 那么整张表的所有行就是一组
-- [ORDER BY <....>]用于指定每个分组内的数据排序规则支持ASC、DESC
-- [<window expression>]用于指定每个窗口中操作的数据范围默认是窗口中所有行
窗口函数使用
建表准备数据
name cadate money
jack,2017-01-01,10
tony,2017-01-02,15
jack,2017-02-03,23
tony,2017-01-04,29
jack,2017-01-05,46
jack,2017-04-06,46
tony,2017-01-07,50
jack,2017-01-08,55
mart,2017-04-08,62
mart,2017-04-09,68
neil,2017-05-10,12
mart,2017-04-11,75
neil,2017-06-12,80
mart,2017-04-13,94
create table t_orders
(
name string,
cdate string,
money double
) row format delimited fields terminated by ',';
load data local inpath '/root/orders.txt' into table t_orders;
select * from t_orders;
-- 查询每个用户总订单金额
select name ,sum(money) from t_orders group by name;
-- 查询每个月的订单总数 count+group by 普通常规聚合操作
select substr(cdate,0,7) month ,count(1) from t_orders group by substr(cdate , 0 , 7);
-- sum+窗口函数 总共有四种用法
-- sum(...) over()对表所有行求和
-- sum(...) over(order by ...) 连续累积求和
-- sum(...) over(partition by ...) 同组内所有行求和
-- sum(...) over(partition by...order by...) 在每个分组内连续累积求和
-- 查询所有所有用户的总订单金额
-- sum(...) over()对表所有行求和
select * ,sum(money) over() as total_money from t_orders;
-- 查询每个用户的订单总金额
-- sum(...) over(partition by ...) 同组内所有行求和
select * ,sum(money) over(partition by name) as user_money from t_orders;
-- 查询每个用户的订单总金额 按天数排序 累加
-- sum(...) over(partition by...order by...) 在每个分组内连续累积求和
select name,money,sum(money) over(partition by name order by cdate) as user_money from t_orders;
+-------+--------+-------------+
| name | money | user_money |
+-------+--------+-------------+
| jack | 10.0 | 10.0 |
| jack | 46.0 | 56.0 |
| jack | 55.0 | 111.0 |
| jack | 23.0 | 134.0 |
| jack | 42.0 | 176.0 |
| mart | 62.0 | 62.0 |
| mart | 68.0 | 130.0 |
| mart | 75.0 | 205.0 |
| mart | 94.0 | 299.0 |
| neil | 12.0 | 12.0 |
| neil | 80.0 | 92.0 |
| tony | 15.0 | 15.0 |
| tony | 29.0 | 44.0 |
| tony | 50.0 | 94.0 |
+-------+--------+-------------+
-- 查询每个月的订单总金额 按照天数累加
select cdate,concat(substr(cdate,6,2),'月') as month ,sum(money) over (partition by substr(cdate,0,7) order by cdate) from t_orders;
+--------+-------------+---------------+
| month | cdate | sum_window_0 |
+--------+-------------+---------------+
| 01月 | 2017-01-01 | 10.0 |
| 01月 | 2017-01-02 | 25.0 |
| 01月 | 2017-01-04 | 54.0 |
| 01月 | 2017-01-05 | 100.0 |
| 01月 | 2017-01-07 | 150.0 |
| 01月 | 2017-01-08 | 205.0 |
| 02月 | 2017-02-03 | 23.0 |
| 04月 | 2017-04-06 | 42.0 |
| 04月 | 2017-04-08 | 104.0 |
| 04月 | 2017-04-09 | 172.0 |
| 04月 | 2017-04-11 | 247.0 |
| 04月 | 2017-04-13 | 341.0 |
| 05月 | 2017-05-10 | 12.0 |
| 06月 | 2017-06-12 | 80.0 |
+--------+-------------+---------------+
2 窗口表达式
- 在sum(…) over( partition by… order by …)语法完整的情况下,进行累积聚合操作,默认累积聚合行为是∶从第一行聚合到当前行。
- Window expression窗口表达式给我们提供了一种控制行范围的能力,比如向前2行,向后3行。
关键字是rows between,包括下面这几个选项
- preceding :往前
- following :往后
- current row:当前行
- unbounded:边界
- unbounded preceding表示从前面的起点
- unbounded following :表示到后面的终点
代码演示
-- 前一行到当前行
select name,
money,
sum(money) over (partition by name order by cdate rows between 1 preceding and current row ) as user_money
from t_orders;
-- 当前行到后一行
select name,
money,
sum(money) over (partition by name order by cdate rows between current row and 1 following ) as user_money
from t_orders;
-- 前一行到后一行
select name,
money,
sum(money) over (partition by name order by cdate rows between 1 preceding and 1 following ) as user_money
from t_orders;
-- 当前行到最后一行
select name,
money,
sum(money)
over (partition by name order by cdate rows between current row and unbounded following ) as user_money
from t_orders;
-- 第一行到最后一行 组内所有行
select name,
money,
sum(money)
over (partition by name order by cdate rows between unbounded preceding and unbounded following ) as user_money
from t_orders;
3 编号函数
row_number:在每个分组中,为每行分配一个从1开始的唯一序列号,递增,不考虑重复;
rank:在每个分组中,为每行分配一个从1开始的序列号,考虑重复,挤占后续位置;
dense_rank:在每个分组中,为每行分配一个从1开始的序列号,考虑重复,不挤占后续位置;
select name,
money,
row_number() over (partition by name order by money) as r_num,
rank() over (partition by name order by money) as rank,
dense_rank() over (partition by name order by money) as ds_rank
from t_orders;
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-aV6luSqH-1684755050476)(img/28_num.png)]
-- 查询每个人消费最高的订单的日期和金额
select name,
cdate,
money,
row_number() over (partition by name order by money desc) as r_num
from t_orders;
with t1 as (select name,
cdate,
money,
row_number() over (partition by name order by money desc) as r_num
from t_orders)
select name, cdate, money
from t1
where r_num = 1;
+-------+-------------+--------+
| name | cdate | money |
+-------+-------------+--------+
| jack | 2017-01-08 | 55.0 |
| mart | 2017-04-13 | 94.0 |
| neil | 2017-06-12 | 80.0 |
| tony | 2017-01-07 | 50.0 |
+-------+-------------+--------+
ntile函数
将每个分组内的数据分为指定的若干个桶里(分为若干个部分),并且为每一个桶分配一个桶编号。
如果不能平均分配,则优先分配较小编号的桶,并且各个桶中能放的行数最多相差1。
有时会有这样的需求:如果数据排序后分为三部分,业务人员只关心其中的一部分,如何将这中间的三分之一数据拿出来呢?
这时可以使用ntile函数
select name,
money,
ntile(3) over (partition by name order by money) as r_num
from t_orders;
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-JMZuIAUT-1684755050477)(img/29_ntile.png)]
-- 查询前20%时间的订单信息 ntile(5)
select *, ntile(5) over (order by cdate) as n
from t_orders;
with t1 as (select *,
ntile(5) over (order by cdate) as n
from t_orders)
select name, cdate, money
from t1
where n = 1;
+-------+-------------+--------+
| name | cdate | money |
+-------+-------------+--------+
| jack | 2017-01-01 | 10.0 |
| tony | 2017-01-02 | 15.0 |
| tony | 2017-01-04 | 29.0 |
+-------+-------------+--------+
4 窗口分析函数
- lag(col,n,default) 用于统计窗口内往上第n行值
第一个参数为列名,第二个参数为往上第n行(可选,默认为1),第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL);
-
lead(col,n,default) 用于统计窗口内往下第n行值
第一个参数为列名,第二个参数为往下第n行(可选,默认为1),第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL);
-
first_value取分组内排序后,截止到当前行,第一个值
-
last_value取分组内排序后,截止到当前行,最后一个值
-- lag 上n行
select name,
money,
lag(money, 1) over (partition by name order by money) as num,
lag(money, 1, 0) over (partition by name order by money) as num
from t_orders;
--lead 下n行
select name,
money,
lead(money, 1) over (partition by name order by money) as num,
lead(money, 1, 0) over (partition by name order by money) as num
from t_orders;
--first_value 第一行
select name,
money,
first_value(money) over (partition by name order by money) as num
from t_orders;
--last_value 最后一行 当前行就是最后一行
select name,
money,
last_value(money) over (partition by name order by money ) as num
from t_orders;
-- 查询顾客上次的购买时间
select name,
cdate,
lag(cdate, 1) over (partition by name order by cdate) as last_date
from t_orders;
2 查询练习
1 连续登陆
uid login_date
001,2017-02-05 12:00:00
001,2017-02-05 14:00:00
001,2017-02-06 13:00:00
001,2017-02-07 12:00:00
001,2017-02-08 12:00:00
001,2017-02-10 14:00:00
002,2017-02-05 13:00:00
002,2017-02-06 12:00:00
002,2017-02-06 14:00:00
002,2017-02-08 12:00:00
002,2017-02-09 16:00:00
002,2017-02-10 12:00:00
003,2017-01-31 13:00:00
003,2017-01-31 12:00:00
003,2017-02-01 12:00:00
004,2017-02-02 12:00:00
004,2017-02-03 12:00:00
004,2017-02-10 12:00:00
004,2017-03-01 12:00:00
create table t_login_user(
uid string,
login_date string
)row format delimited fields terminated by ",";
load data local inpath "/root/login_user.txt" overwrite into table t_login_user;
select * from t_login_user;
计算连续登陆2天的用户
第一种方式
– 查询连续登陆n天的用户 – 我们可以基于用户的登陆信息,找到如下规律:
– 连续两天登陆 : 用户下次登陆时间 = 本次登陆以后的第二天
– 连续三天登陆 : 用户下下次登陆时间 = 本次登陆以后的第三天……
– 我们可以对用户ID进行分区,按照登陆时间进行排序,通过lead函数计算出用户下次登陆时间通过日期函数计算出登陆以后第二天的日期,如果相等即为连续两天登录
-- 去掉用户重复登陆的记录
select distinct uid,date_format(login_date,'yyyy-MM-dd') from t_login_user;
-- 在去掉用户重复登录的基础上,对用户分组,对登陆日期排序 计算如果连续登陆 那么下一次登陆的日期
with t1 as ( select distinct uid,date_format(login_date,'yyyy-MM-dd') as login_date from t_login_user )
select * ,
date_add(login_date,1) as next_date,
lead(login_date,1,0) over (partition by uid order by login_date) as next_login
from t1;
-- 查询下一次登陆日期 和 下一行记录相等的用户 就是连续登陆2天的用户
with t1 as ( select distinct uid,date_format(login_date,'yyyy-MM-dd') as login_date from t_login_user ),
t2 as (select *,
date_add(login_date,1) as next_date,
lead(login_date,1,0) over (partition by uid order by login_date) as next_login
from t1 )
select distinct uid from t2 where t2.next_date == t2.next_login;
-- 查询连续3天登陆的用户
with t1 as ( select distinct uid,date_format(login_date,'yyyy-MM-dd') as login_date from t_login_user ),
t2 as (select *,
date_add(login_date,2) as next_date,
lead(login_date,2,0) over (partition by uid order by login_date) as next_login
from t1 )
select distinct uid from t2 where t2.next_date == t2.next_login;
-- 查询连续N天登陆的用户
select *,
--本次登陆日期的第N天
date_add(登陆日期,N-1) as next_date,
--按照用户id分区,按照登陆日期排序 取对应N-1行的数据
lead(登陆日期,N-1,0) over (partition by 用户 order by 登陆日期) as next_login
from t
-- 查询连续登陆大于4天的用户
with t1 as (select distinct uid, date_format(login_date, 'yyyy-MM-dd') as login_date from t_login_user),
t2 as (
select uid,
date_format(login_date, 'yyyy-MM-dd') as interval_date,
row_number() over (partition by uid order by login_date ) as rn
from t1
),
t3 as (
select *, date_sub(interval_date, rn) as login_date
from t2
)
select uid, count(1)
from t3
group by uid, login_date
having count(1) >= 4;
第二种方式
-- 去掉当天重复登陆信息
select distinct uid, date_format(login_date, 'yyyy-MM-dd') as login_date
from t_login_user;
-- 窗口函数用户分组,登陆日期排序,行号
with t1 as (select distinct uid, date_format(login_date, 'yyyy-MM-dd') as login_date from t_login_user)
select *,
row_number() over (partition by uid order by login_date) as rn
from t1
-- 登陆日期-编号 = 间隙日期
with t1 as (select distinct uid, date_format(login_date, 'yyyy-MM-dd') as login_date from t_login_user),
t2 as (select *,
row_number() over (partition by uid order by login_date) as rn
from t1)
select *,
date_sub(login_date, rn) as interval_date
from t2;
-- 用户 间隙日期分组 计数 >=2 为连续两天登陆 >=n 为连续n天记录
with t1 as (select distinct uid, date_format(login_date, 'yyyy-MM-dd') as login_date from t_login_user),
t2 as (select *, row_number() over (partition by uid order by login_date) as rn from t1),
t3 as (select *, date_sub(login_date, rn) as interval_date from t2)
select uid, count(1) as login_count
from t3
group by uid, interval_date
having count(1) >= 2;
分组topN
查询每个用户最高连续登陆天数
-- 查询每个用户连续登陆的天数
with t1 as (select distinct uid, date_format(login_date, 'yyyy-MM-dd') as login_date from t_login_user),
t2 as (select *, row_number() over (partition by uid order by login_date) as rn from t1),
t3 as (select *, date_sub(login_date, rn) as interval_date from t2)
select uid, count(1) as login_count
from t3
group by uid, interval_date
-- 分组 设置编号
with t1 as (select distinct uid, date_format(login_date, 'yyyy-MM-dd') as login_date from t_login_user),
t2 as (select *, row_number() over (partition by uid order by login_date) as rn from t1),
t3 as (select *, date_sub(login_date, rn) as interval_date from t2),
t4 as (select uid, count(1) as login_count from t3 group by uid, interval_date)
select *, row_number() over (partition by uid order by login_count desc) as rn
from t4;
-- topn
with t1 as (select distinct uid, date_format(login_date, 'yyyy-MM-dd') as login_date from t_login_user),
t2 as (select *, row_number() over (partition by uid order by login_date) as rn from t1),
t3 as (select *, date_sub(login_date, rn) as interval_date from t2),
t4 as (select uid, count(1) as login_count from t3 group by uid, interval_date),
t5 as (select *, row_number() over (partition by uid order by login_count desc) as rn from t4)
select *
from t5
where rn <= 1;
2 打地鼠游戏
uid,hit,m
1,1,0
1,2,1
1,3,1
1,4,1
1,5,0
1,6,0
1,7,1
2,1,1
2,2,1
2,3,1
2,4,1
2,5,1
3,1,1
3,2,1
3,3,1
3,4,0
3,5,0
3,6,1
3,7,0
3,8,1
create table tb_ds(
uid int , -- 用户名
hit int , -- 第几次打地鼠
m int -- 是否命中 1命中 0 未命中
)
row format delimited fields terminated by ',' ;
load data local inpath '/root/ds.txt' into table tb_ds ;
select * from tb_ds;
查询用户最大连续命中次数
--查询命中的的记录
select *
from tb_ds
where m = 1;
-- 用户分组 记录编号
select uid, hit, row_number() over (partition by uid order by hit) as rn
from tb_ds
where m = 1;
-- hit-rn
with t1 as (select uid, hit, row_number() over (partition by uid order by hit) as rn from tb_ds where m = 1)
select *, (hit - rn) as sub
from t1
-- 分组得到每个用户的连续击中
with t1 as (select uid, hit, row_number() over (partition by uid order by hit) as rn from tb_ds where m = 1),
t2 as (select *, (hit - rn) as sub from t1)
select uid, count(*) as hit_count
from t2
group by uid, sub
-- 用户分组得到每个用户最大次数
with t1 as (select uid, hit, row_number() over (partition by uid order by hit) as rn from tb_ds where m = 1),
t2 as (select *, (hit-rn) as sub from t1 ),
t3 as (select uid, count(*) as hit_count from t2 group by uid, sub )
select uid, max(hit_count) as hit_count
from t3
group by uid;
3 WordCount
Why Studying History Matters
Studying a subject that you feel pointless is never a fun or easy task.
If you're study history, asking yourself the question "why is history important" is a very good first step.
History is an essential part of human civilization.
You will find something here that will arouse your interest, or get you thinking about the significance of history.
求每个单词出现的次数
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ikIJJubj-1684755050479)(img/34_wc.png)]
create table t_wc(
line string
)row format delimited lines terminated by "\n";
load data local inpath '/root/word.txt' overwrite into table t_wc;
select * from t_wc;
-- 使用正则表达式去掉特殊符号 将 除了单词字符 及 ' 和 空格的 所有字符替换成空串
select regexp_replace(line, '[^a-zA-Z_0-9\'\\s]', "") as word_line
from t_wc;
-- 使用split进行切割 将一行数据切割为一个数组
select split(regexp_replace(line, '[^a-zA-Z_0-9\'\\s]', ""), "\\s+")
from t_wc;
-- 使用expload函数进行炸裂
select explode(split(regexp_replace(line, '[^a-zA-Z_0-9\'\\s]', ""), "\\s+")) as word
from t_wc;
-- 分组得到每个单词的次数
with t1 as (select explode(split(regexp_replace(line, '[^a-zA-Z_0-9\'\\s]', ""), "\\s+")) as word from t_wc)
select word, count(1) as word_count
from t1
group by word;
3 json数据处理
JSON数据格式是数据存储及数据处理中最常见的结构化数据格式之一,很多场景下公司都会将数据以JSON格式存储在HDFS中,当构建数据仓库时,需要对JSON格式的数据进行处理和分析,那么就需要在Hive中对JSON格式的数据进行解析读取。
{"movie":"1240","rate":"5","timeStamp":"978294260","uid":"4"}
{"movie":"2987","rate":"4","timeStamp":"978243170","uid":"5"}
{"movie":"2333","rate":"4","timeStamp":"978242607","uid":"5"}
{"movie":"1175","rate":"5","timeStamp":"978244759","uid":"5"}
{"movie":"39","rate":"3","timeStamp":"978245037","uid":"5"}
{"movie":"288","rate":"2","timeStamp":"978246585","uid":"5"}
{"movie":"2337","rate":"5","timeStamp":"978243121","uid":"5"}
{"movie":"1535","rate":"4","timeStamp":"978245513","uid":"5"}
{"movie":"1392","rate":"4","timeStamp":"978245645","uid":"5"}
1 函数处理json数据
Hive中提供了两个专门用于解析JSON字符串的函数:get_json_object,json_tuple,这两个函数都可以实现将JSON数据中的每个字段独立解析出来,构建成表。
建表
将一条json语句作为一个字符串处理
create table test_json(
str string
) ;
load data local inpath '/root/movie.txt' into table test_json ;
select * from test_json;
使用get_json_object 解析
select get_json_object(str, "$.movie") movie,
get_json_object(str, "$.rate") rate,
get_json_object(str, "$.timeStamp") ts,
get_json_object(str, "$.uid") uid
from test_json;
使用json_tuple函数解析
select json_tuple(str, "movie", "rate", "timeStamp", "uid") as (movie, rate, ts, uid)
from test_json;
2 JSONSerde处理
使用函数解析JSON的过程中是将数据作为一个JSON字符串加载到表中,再通过JSON解析函数对JSON字符串进行解析,灵活性比较高,但是对于如果整个文件就是一个JSON文件,在使用起来就相对比较麻烦。Hive中为了简化对于JSON文件的处理,内置了一种专门用于解析JSON文件的Serde解析器,在创建表时,只要指定使用JSONSerde,就会自动将JSON文件中的每一列进行解析。
create table test_json2
(
movie string,
rate string,
`timeStamp` string,
uid string
)
-- 指定json解析器解析
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE;
load data local inpath '/root/movie.txt' into table test_json2 ;
select * from test_json2;
desc formatted test_json2;
3 多字节分隔符处理
Hive默认序列化类是LazySimpleSerDe,其只支持使用单字节分隔符(char)来加载文本数据,例如逗号、制表符、空格等等,默认的分隔符为”\001”。根据不同文件的不同分隔符,我们可以通过在创建表时使用 row format delimited 来指定文件中的分割符,确保正确将表中的每一列与文件中的每一列实现一一对应的关系。
但是工作中有可能遇到特殊的数据
情况一:每一行数据的分隔符是多字节分隔符,例如:”||”、“–”等
01||周杰伦||中国||台湾||男||七里香
02||刘德华||中国||香港||男||笨小孩
情况二:数据的字段中包含了分隔符
192.168.88.134 [08/Nov/2020:10:44:32 +0800] "GET / HTTP/1.1" 404 951
192.168.88.100 [08/Nov/2020:10:44:33 +0800] "GET /hpsk_sdk/index.html HTTP/1.1" 200 328
如果遇到上面两种情况使用LazySimpleSerDe是没有办法处理的,处理后的数据有问题.这时其实有多种方式解决,这里我们选择使用RegexSerDe正则加载
- 除了使用最多的LazySimpleSerDe,Hive中内置了很多SerDe类;
- 官网地址:https://cwiki.apache.org/confluence/display/Hive/SerDe
- 多种SerDe用于解析和加载不同类型的数据文件,常用的有ORCSerDe 、RegexSerDe、JsonSerDe等
情况一
01||周杰伦||中国||台湾||男||七里香
02||刘德华||中国||香港||男||笨小孩
03||汪 峰||中国||北京||男||光明
04||朴 树||中国||北京||男||那些花儿
05||许 巍||中国||陕西||男||故乡
create table singer
(
id string,
name string,
country string,
province string,
gender string,
works string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES
("input.regex" = "([0-9]*)\\|\\|(.*)\\|\\|(.*)\\|\\|(.*)\\|\\|(.*)\\|\\|(.*)");
load data local inpath '/root/singer.txt' into table singer;
select * from singer;
情况二
192.168.88.134 [08/Nov/2020:10:44:32 +0800] "GET / HTTP/1.1" 404 951
192.168.88.100 [08/Nov/2020:10:44:33 +0800] "GET /hpsk_sdk/index.html HTTP/1.1" 200 328
192.168.88.134 [08/Nov/2020:20:19:06 +0800] "GET / HTTP/1.1" 404 951
192.168.88.100 [08/Nov/2020:20:19:13 +0800] "GET /hpsk_sdk/demo4.html HTTP/1.1" 200 982
192.168.88.100 [08/Nov/2020:20:19:13 +0800] "GET /hpsk_sdk/js/analytics.js HTTP/1.1" 200 11095
192.168.88.100 [08/Nov/2020:20:19:23 +0800] "GET /hpsk_sdk/demo3.html HTTP/1.1" 200 1024
192.168.88.100 [08/Nov/2020:20:19:26 +0800] "GET /hpsk_sdk/demo2.html HTTP/1.1" 200 854
192.168.88.100 [08/Nov/2020:20:19:27 +0800] "GET /hpsk_sdk/demo.html HTTP/1.1" 200 485
192.168.88.134 [08/Nov/2020:20:26:51 +0800] "GET / HTTP/1.1" 404 951
192.168.88.134 [08/Nov/2020:20:29:08 +0800] "GET / HTTP/1.1" 404 951
create table t_log
(
ip string, --IP地址
stime string, --时间
mothed string, --请求方式
url string, --请求地址
policy string, --请求协议
stat string, --请求状态
body string --字节大小
)
--指定使用RegexSerde加载数据
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
--指定正则表达式
WITH SERDEPROPERTIES (
"input.regex" = "([^ ]*) ([^}]*) ([^ ]*) ([^ ]*) ([^ ]*) ([0-9]*) ([^ ]*)"
)
load data local inpath '/root/a.log' into table t_log;
select * from t_log;
-- 时间转换
select
from_unixtime(unix_timestamp(stime, '[dd/MMM/yyyy:HH:mm:ss +0800]'),'yyyy-MM-dd HH:mm:ss')
from t_log;