hive函数02

news2024/11/24 0:30:11

hive函数02

窗口函数

窗口函数(Window functions )也叫做开窗函数、OLAP函数,其最大特点是∶输入值是从SELECT语句的结果集中的一行或多行的“窗口”中获取的。
窗口函数可以简单地解释为类似于聚合函数的计算函数,但是通过GROUP BY子句组合的常规聚合会隐藏正在聚合的各个行,最终输出一行,窗口函数聚合后还可以访问当中的各个行,并且可以将这些行中的某些属性添加到结果集中。

在这里插入图片描述

select * from stu_mark;
-- 常规分组查询 求分数和
select sname,sum(score) from stu_mark group by  sname;

在这里插入图片描述

-- 窗口函数分组
select sname,subject,score,sum(score) over (partition by sname) as total_score  from stu_mark;

1 窗口函数语法规则

Function(arg1,.. ., argn) OVER( [PARTITION BY <...>][ORDER BY <....>][<window_expression>])


-- 其中Function (argl , ..., argn)可以是下面分类中的任意一个
    -- 聚合函数:比如sum max avg等
    -- 排序函数:比如rank row_number等
    -- 分析函数:比如lead lag first value等
-- OVER [PARTITION BY<...>] 类似于group by用于指定分组每个分组你可以把它叫做窗口
	-- 如果没有PARTITION BY 那么整张表的所有行就是一组
-- [ORDER BY <....>]用于指定每个分组内的数据排序规则支持ASC、DESC
-- [<window expression>]用于指定每个窗口中操作的数据范围默认是窗口中所有行

窗口函数使用

建表准备数据

name cadate   money
jack,2017-01-01,10
tony,2017-01-02,15
jack,2017-02-03,23
tony,2017-01-04,29
jack,2017-01-05,46
jack,2017-04-06,46
tony,2017-01-07,50
jack,2017-01-08,55
mart,2017-04-08,62
mart,2017-04-09,68
neil,2017-05-10,12
mart,2017-04-11,75
neil,2017-06-12,80
mart,2017-04-13,94

create table t_orders
(
    name  string,
    cdate string,
    money double
) row format delimited fields terminated by ',';

load data local inpath '/root/orders.txt' into table t_orders;

select * from t_orders;
-- 查询每个用户总订单金额
select name ,sum(money) from t_orders group by name;
-- 查询每个月的订单总数 count+group by 普通常规聚合操作
select substr(cdate,0,7)  month ,count(1)  from t_orders group by substr(cdate , 0 , 7);

-- sum+窗口函数 总共有四种用法
-- sum(...) over()对表所有行求和
-- sum(...) over(order by ...) 连续累积求和
-- sum(...) over(partition by ...) 同组内所有行求和
-- sum(...) over(partition by...order by...) 在每个分组内连续累积求和

-- 查询所有所有用户的总订单金额
-- sum(...) over()对表所有行求和
select * ,sum(money) over() as total_money  from t_orders;
-- 查询每个用户的订单总金额
-- sum(...) over(partition by ...) 同组内所有行求和
select * ,sum(money) over(partition by name) as user_money  from t_orders;

-- 查询每个用户的订单总金额 按天数排序 累加
-- sum(...) over(partition by...order by...) 在每个分组内连续累积求和
select name,money,sum(money) over(partition by name order by cdate) as user_money  from t_orders;
+-------+--------+-------------+
| name  | money  | user_money  |
+-------+--------+-------------+
| jack  | 10.0   | 10.0        |
| jack  | 46.0   | 56.0        |
| jack  | 55.0   | 111.0       |
| jack  | 23.0   | 134.0       |
| jack  | 42.0   | 176.0       |
| mart  | 62.0   | 62.0        |
| mart  | 68.0   | 130.0       |
| mart  | 75.0   | 205.0       |
| mart  | 94.0   | 299.0       |
| neil  | 12.0   | 12.0        |
| neil  | 80.0   | 92.0        |
| tony  | 15.0   | 15.0        |
| tony  | 29.0   | 44.0        |
| tony  | 50.0   | 94.0        |
+-------+--------+-------------+
-- 查询每个月的订单总金额 按照天数累加
select cdate,concat(substr(cdate,6,2),'月') as month ,sum(money) over (partition by substr(cdate,0,7) order by cdate) from t_orders;
+--------+-------------+---------------+
| month  |    cdate    | sum_window_0  |
+--------+-------------+---------------+
| 01月    | 2017-01-01  | 10.0          |
| 01月    | 2017-01-02  | 25.0          |
| 01月    | 2017-01-04  | 54.0          |
| 01月    | 2017-01-05  | 100.0         |
| 01月    | 2017-01-07  | 150.0         |
| 01月    | 2017-01-08  | 205.0         |
| 02月    | 2017-02-03  | 23.0          |
| 04月    | 2017-04-06  | 42.0          |
| 04月    | 2017-04-08  | 104.0         |
| 04月    | 2017-04-09  | 172.0         |
| 04月    | 2017-04-11  | 247.0         |
| 04月    | 2017-04-13  | 341.0         |
| 05月    | 2017-05-10  | 12.0          |
| 06月    | 2017-06-12  | 80.0          |
+--------+-------------+---------------+

2 窗口表达式

  • 在sum(…) over( partition by… order by …)语法完整的情况下,进行累积聚合操作,默认累积聚合行为是∶从第一行聚合到当前行。
  • Window expression窗口表达式给我们提供了一种控制行范围的能力,比如向前2行,向后3行。
 关键字是rows between,包括下面这几个选项
- preceding :往前
- following :往后
- current row:当前行
- unbounded:边界
- unbounded preceding表示从前面的起点
- unbounded following :表示到后面的终点

代码演示

-- 前一行到当前行
select name,
       money,
       sum(money) over (partition by name order by cdate rows between 1 preceding and current row ) as user_money
from t_orders;

-- 当前行到后一行
select name,
       money,
       sum(money) over (partition by name order by cdate rows between current row and 1 following ) as user_money
from t_orders;

-- 前一行到后一行
select name,
       money,
       sum(money) over (partition by name order by cdate rows between 1 preceding and 1 following ) as user_money
from t_orders;
-- 当前行到最后一行
select name,
       money,
       sum(money)
           over (partition by name order by cdate rows between current row and unbounded following ) as user_money
from t_orders;

-- 第一行到最后一行 组内所有行
select name,
       money,
       sum(money)
           over (partition by name order by cdate rows between unbounded preceding and unbounded following ) as user_money
from t_orders;

3 编号函数

row_number:在每个分组中,为每行分配一个从1开始的唯一序列号,递增,不考虑重复;
rank:在每个分组中,为每行分配一个从1开始的序列号,考虑重复,挤占后续位置;
dense_rank:在每个分组中,为每行分配一个从1开始的序列号,考虑重复,不挤占后续位置;
select name,
       money,
       row_number() over (partition by name order by money) as r_num,
       rank() over (partition by name order by money)       as rank,
       dense_rank() over (partition by name order by money) as ds_rank
from t_orders;

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-aV6luSqH-1684755050476)(img/28_num.png)]

-- 查询每个人消费最高的订单的日期和金额
select name,
       cdate,
       money,
       row_number() over (partition by name order by money desc) as r_num
from t_orders;

with t1 as (select name,
                   cdate,
                   money,
                   row_number() over (partition by name order by money desc) as r_num
            from t_orders)
select name, cdate, money
from t1
where r_num = 1;

+-------+-------------+--------+
| name  |    cdate    | money  |
+-------+-------------+--------+
| jack  | 2017-01-08  | 55.0   |
| mart  | 2017-04-13  | 94.0   |
| neil  | 2017-06-12  | 80.0   |
| tony  | 2017-01-07  | 50.0   |
+-------+-------------+--------+
ntile函数 
将每个分组内的数据分为指定的若干个桶里(分为若干个部分),并且为每一个桶分配一个桶编号。
如果不能平均分配,则优先分配较小编号的桶,并且各个桶中能放的行数最多相差1。
有时会有这样的需求:如果数据排序后分为三部分,业务人员只关心其中的一部分,如何将这中间的三分之一数据拿出来呢?
这时可以使用ntile函数 
select name,
       money,
       ntile(3) over (partition by name order by money) as r_num
from t_orders;

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-JMZuIAUT-1684755050477)(img/29_ntile.png)]

-- 查询前20%时间的订单信息 ntile(5)
select *, ntile(5) over (order by cdate) as n
from t_orders;

with t1 as (select *,
                   ntile(5) over (order by cdate) as n
            from t_orders)
select name, cdate, money
from t1
where n = 1;

+-------+-------------+--------+
| name  |    cdate    | money  |
+-------+-------------+--------+
| jack  | 2017-01-01  | 10.0   |
| tony  | 2017-01-02  | 15.0   |
| tony  | 2017-01-04  | 29.0   |
+-------+-------------+--------+

4 窗口分析函数

  • lag(col,n,default) 用于统计窗口内往上第n行值

​ 第一个参数为列名,第二个参数为往上第n行(可选,默认为1),第三个参数为默认值(当往上第n行为NULL时候,取默认值,如不指定,则为NULL);

  • lead(col,n,default) 用于统计窗口内往下第n行值

    第一个参数为列名,第二个参数为往下第n行(可选,默认为1),第三个参数为默认值(当往下第n行为NULL时候,取默认值,如不指定,则为NULL);

  • first_value取分组内排序后,截止到当前行,第一个值

  • last_value取分组内排序后,截止到当前行,最后一个值

-- lag 上n行
select name,
       money,
       lag(money, 1) over (partition by name order by money)    as num,
       lag(money, 1, 0) over (partition by name order by money) as num
from t_orders;

--lead 下n行
select name,
       money,
       lead(money, 1) over (partition by name order by money)    as num,
       lead(money, 1, 0) over (partition by name order by money) as num
from t_orders;

--first_value 第一行
select name,
       money,
       first_value(money) over (partition by name order by money) as num
from t_orders;
--last_value 最后一行 当前行就是最后一行
select name,
       money,
       last_value(money) over (partition by name order by money ) as num
from t_orders;
-- 查询顾客上次的购买时间
select name,
       cdate,
       lag(cdate, 1) over (partition by name order by cdate) as last_date
from t_orders;

2 查询练习

1 连续登陆

uid login_date
001,2017-02-05 12:00:00
001,2017-02-05 14:00:00
001,2017-02-06 13:00:00
001,2017-02-07 12:00:00
001,2017-02-08 12:00:00
001,2017-02-10 14:00:00
002,2017-02-05 13:00:00
002,2017-02-06 12:00:00
002,2017-02-06 14:00:00
002,2017-02-08 12:00:00
002,2017-02-09 16:00:00
002,2017-02-10 12:00:00
003,2017-01-31 13:00:00
003,2017-01-31 12:00:00
003,2017-02-01 12:00:00
004,2017-02-02 12:00:00
004,2017-02-03 12:00:00
004,2017-02-10 12:00:00
004,2017-03-01 12:00:00


create table  t_login_user(
                              uid string,
                              login_date string
)row format delimited fields terminated by ",";

load data local inpath "/root/login_user.txt" overwrite into table t_login_user;

select * from t_login_user;

计算连续登陆2天的用户

第一种方式

– 查询连续登陆n天的用户 – 我们可以基于用户的登陆信息,找到如下规律:
– 连续两天登陆 : 用户下次登陆时间 = 本次登陆以后的第二天
– 连续三天登陆 : 用户下下次登陆时间 = 本次登陆以后的第三天……
– 我们可以对用户ID进行分区,按照登陆时间进行排序,通过lead函数计算出用户下次登陆时间通过日期函数计算出登陆以后第二天的日期,如果相等即为连续两天登录

-- 去掉用户重复登陆的记录
select distinct uid,date_format(login_date,'yyyy-MM-dd') from t_login_user;
-- 在去掉用户重复登录的基础上,对用户分组,对登陆日期排序 计算如果连续登陆 那么下一次登陆的日期
with t1 as ( select distinct uid,date_format(login_date,'yyyy-MM-dd') as login_date from t_login_user )
    select * ,
           date_add(login_date,1) as next_date,
           lead(login_date,1,0) over (partition by uid order by login_date) as next_login
from t1;
-- 查询下一次登陆日期 和 下一行记录相等的用户 就是连续登陆2天的用户
with t1 as ( select distinct uid,date_format(login_date,'yyyy-MM-dd') as login_date from t_login_user ),
     t2 as (select *,
    date_add(login_date,1) as next_date,
    lead(login_date,1,0) over (partition by uid order by login_date) as next_login
    from t1 )
select distinct uid from t2 where t2.next_date == t2.next_login;

-- 查询连续3天登陆的用户 
with t1 as ( select distinct uid,date_format(login_date,'yyyy-MM-dd') as login_date from t_login_user ),
     t2 as (select *,
    date_add(login_date,2) as next_date,
    lead(login_date,2,0) over (partition by uid order by login_date) as next_login
    from t1 )
select distinct uid from t2 where t2.next_date == t2.next_login;

-- 查询连续N天登陆的用户 
select *,
    --本次登陆日期的第N天
    date_add(登陆日期,N-1) as next_date,
    --按照用户id分区,按照登陆日期排序 取对应N-1行的数据
    lead(登陆日期,N-1,0) over (partition by 用户 order by 登陆日期) as next_login
    from t
-- 查询连续登陆大于4天的用户
with t1 as (select distinct uid, date_format(login_date, 'yyyy-MM-dd') as login_date from t_login_user),
     t2 as (
         select uid,
                date_format(login_date, 'yyyy-MM-dd')                     as interval_date,
                row_number() over (partition by uid order by login_date ) as rn
         from t1
     ),
     t3 as (
         select *, date_sub(interval_date, rn) as login_date
         from t2
     )
select uid, count(1)
from t3
group by uid, login_date
having count(1) >= 4;

第二种方式

-- 去掉当天重复登陆信息
select distinct uid, date_format(login_date, 'yyyy-MM-dd') as login_date
from t_login_user;
-- 窗口函数用户分组,登陆日期排序,行号
with t1 as (select distinct uid, date_format(login_date, 'yyyy-MM-dd') as login_date from t_login_user)
select *,
       row_number() over (partition by uid order by login_date) as rn
from t1

-- 登陆日期-编号 = 间隙日期 
with t1 as (select distinct uid, date_format(login_date, 'yyyy-MM-dd') as login_date from t_login_user),
     t2 as (select *,
                   row_number() over (partition by uid order by login_date) as rn
            from t1)
select *,
       date_sub(login_date, rn) as interval_date
from t2;

-- 用户 间隙日期分组 计数 >=2 为连续两天登陆  >=n 为连续n天记录
with t1 as (select distinct uid, date_format(login_date, 'yyyy-MM-dd') as login_date from t_login_user),
     t2 as (select *, row_number() over (partition by uid order by login_date) as rn from t1),
     t3 as (select *, date_sub(login_date, rn) as interval_date from t2)
select uid, count(1) as login_count
from t3
group by uid, interval_date
having count(1) >= 2;

分组topN

查询每个用户最高连续登陆天数

-- 查询每个用户连续登陆的天数
with t1 as (select distinct uid, date_format(login_date, 'yyyy-MM-dd') as login_date from t_login_user),
     t2 as (select *, row_number() over (partition by uid order by login_date) as rn from t1),
     t3 as (select *, date_sub(login_date, rn) as interval_date from t2)
select uid, count(1) as login_count
from t3
group by uid, interval_date

-- 分组 设置编号
with t1 as (select distinct uid, date_format(login_date, 'yyyy-MM-dd') as login_date from t_login_user),
     t2 as (select *, row_number() over (partition by uid order by login_date) as rn from t1),
     t3 as (select *, date_sub(login_date, rn) as interval_date from t2),
     t4 as (select uid, count(1) as login_count from t3 group by uid, interval_date)
select *, row_number() over (partition by uid order by login_count desc) as rn
from t4;

-- topn
with t1 as (select distinct uid, date_format(login_date, 'yyyy-MM-dd') as login_date from t_login_user),
     t2 as (select *, row_number() over (partition by uid order by login_date) as rn from t1),
     t3 as (select *, date_sub(login_date, rn) as interval_date from t2),
     t4 as (select uid, count(1) as login_count from t3 group by uid, interval_date),
     t5 as (select *, row_number() over (partition by uid order by login_count desc) as rn from t4)
select *
from t5
where rn <= 1;

2 打地鼠游戏

uid,hit,m
1,1,0
1,2,1
1,3,1
1,4,1
1,5,0
1,6,0
1,7,1
2,1,1
2,2,1
2,3,1
2,4,1
2,5,1
3,1,1
3,2,1
3,3,1
3,4,0
3,5,0
3,6,1
3,7,0
3,8,1

create table tb_ds(
      uid int ,  -- 用户名
      hit int ,  -- 第几次打地鼠
      m int      -- 是否命中 1命中 0 未命中
)
row format delimited fields terminated by ','  ;
load data local inpath '/root/ds.txt' into table tb_ds ;

select  * from tb_ds;

查询用户最大连续命中次数

--查询命中的的记录
select *
from tb_ds
where m = 1;

-- 用户分组 记录编号
select uid, hit, row_number() over (partition by uid order by hit) as rn
from tb_ds
where m = 1;

-- hit-rn 
with t1 as (select uid, hit, row_number() over (partition by uid order by hit) as rn from tb_ds where m = 1)
select *, (hit - rn) as sub
from t1

-- 分组得到每个用户的连续击中
with t1 as (select uid, hit, row_number() over (partition by uid order by hit) as rn from tb_ds where m = 1),
     t2 as (select *, (hit - rn) as sub from t1)
select uid, count(*) as hit_count
from t2
group by uid, sub

-- 用户分组得到每个用户最大次数
with t1 as (select uid, hit, row_number() over (partition by uid order by hit) as rn from tb_ds where m = 1),
    t2 as (select *, (hit-rn) as sub from t1 ),
    t3 as (select uid, count(*) as hit_count from t2 group by uid, sub )
select uid, max(hit_count) as hit_count
from t3
group by uid;

3 WordCount

Why Studying History Matters
Studying a subject that you feel pointless is never a fun or easy task.
If you're study history, asking yourself the question "why is history important" is a very good first step.
History is an essential part of human civilization.
You will find something here that will arouse your interest, or get you thinking about the significance of history.

求每个单词出现的次数

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ikIJJubj-1684755050479)(img/34_wc.png)]

create table t_wc(
    line string
)row format delimited lines terminated by "\n";

load data local inpath '/root/word.txt' overwrite into table t_wc;
select * from t_wc;
-- 使用正则表达式去掉特殊符号 将 除了单词字符 及 ' 和 空格的 所有字符替换成空串
select regexp_replace(line, '[^a-zA-Z_0-9\'\\s]', "") as word_line
from t_wc;
-- 使用split进行切割 将一行数据切割为一个数组
select split(regexp_replace(line, '[^a-zA-Z_0-9\'\\s]', ""), "\\s+")
from t_wc;
-- 使用expload函数进行炸裂
select explode(split(regexp_replace(line, '[^a-zA-Z_0-9\'\\s]', ""), "\\s+")) as word
from t_wc;
-- 分组得到每个单词的次数
with t1 as (select explode(split(regexp_replace(line, '[^a-zA-Z_0-9\'\\s]', ""), "\\s+")) as word from t_wc)
select word, count(1) as word_count
from t1
group by word;

3 json数据处理

JSON数据格式是数据存储及数据处理中最常见的结构化数据格式之一,很多场景下公司都会将数据以JSON格式存储在HDFS中,当构建数据仓库时,需要对JSON格式的数据进行处理和分析,那么就需要在Hive中对JSON格式的数据进行解析读取。

{"movie":"1240","rate":"5","timeStamp":"978294260","uid":"4"}
{"movie":"2987","rate":"4","timeStamp":"978243170","uid":"5"}
{"movie":"2333","rate":"4","timeStamp":"978242607","uid":"5"}
{"movie":"1175","rate":"5","timeStamp":"978244759","uid":"5"}
{"movie":"39","rate":"3","timeStamp":"978245037","uid":"5"}
{"movie":"288","rate":"2","timeStamp":"978246585","uid":"5"}
{"movie":"2337","rate":"5","timeStamp":"978243121","uid":"5"}
{"movie":"1535","rate":"4","timeStamp":"978245513","uid":"5"}
{"movie":"1392","rate":"4","timeStamp":"978245645","uid":"5"}

1 函数处理json数据

Hive中提供了两个专门用于解析JSON字符串的函数:get_json_object,json_tuple,这两个函数都可以实现将JSON数据中的每个字段独立解析出来,构建成表。

建表

将一条json语句作为一个字符串处理 
create table test_json(
	str string
) ;
load  data  local inpath '/root/movie.txt' into table test_json ;

select * from test_json;

使用get_json_object 解析

select get_json_object(str, "$.movie")     movie,
       get_json_object(str, "$.rate")      rate,
       get_json_object(str, "$.timeStamp") ts,
       get_json_object(str, "$.uid")       uid
from test_json;

使用json_tuple函数解析

select json_tuple(str, "movie", "rate", "timeStamp", "uid") as (movie, rate, ts, uid)
from test_json;

2 JSONSerde处理

使用函数解析JSON的过程中是将数据作为一个JSON字符串加载到表中,再通过JSON解析函数对JSON字符串进行解析,灵活性比较高,但是对于如果整个文件就是一个JSON文件,在使用起来就相对比较麻烦。Hive中为了简化对于JSON文件的处理,内置了一种专门用于解析JSON文件的Serde解析器,在创建表时,只要指定使用JSONSerde,就会自动将JSON文件中的每一列进行解析

create table test_json2
(
    movie       string,
    rate        string,
    `timeStamp` string,
    uid         string
)
-- 指定json解析器解析
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS TEXTFILE;

load  data  local inpath '/root/movie.txt' into table test_json2 ;

select * from test_json2;
desc formatted  test_json2;

3 多字节分隔符处理

​ Hive默认序列化类是LazySimpleSerDe,其只支持使用单字节分隔符(char)来加载文本数据,例如逗号、制表符、空格等等,默认的分隔符为”\001”。根据不同文件的不同分隔符,我们可以通过在创建表时使用 row format delimited 来指定文件中的分割符,确保正确将表中的每一列与文件中的每一列实现一一对应的关系。

但是工作中有可能遇到特殊的数据

情况一:每一行数据的分隔符是多字节分隔符,例如:”||”、“–”等

01||周杰伦||中国||台湾||男||七里香
02||刘德华||中国||香港||男||笨小孩

情况二:数据的字段中包含了分隔符

192.168.88.134 [08/Nov/2020:10:44:32 +0800] "GET / HTTP/1.1" 404 951
192.168.88.100 [08/Nov/2020:10:44:33 +0800] "GET /hpsk_sdk/index.html HTTP/1.1" 200 328

如果遇到上面两种情况使用LazySimpleSerDe是没有办法处理的,处理后的数据有问题.这时其实有多种方式解决,这里我们选择使用RegexSerDe正则加载

  • 除了使用最多的LazySimpleSerDe,Hive中内置了很多SerDe类;
  • 官网地址:https://cwiki.apache.org/confluence/display/Hive/SerDe
  • 多种SerDe用于解析和加载不同类型的数据文件,常用的有ORCSerDe 、RegexSerDe、JsonSerDe等

情况一

01||周杰伦||中国||台湾||男||七里香
02||刘德华||中国||香港||男||笨小孩
03||汪  峰||中国||北京||男||光明
04||朴  树||中国||北京||男||那些花儿
05||许  巍||中国||陕西||男||故乡

create table singer
(
    id       string,
    name     string,
    country  string,
    province string,
    gender   string,
    works    string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES
        ("input.regex" = "([0-9]*)\\|\\|(.*)\\|\\|(.*)\\|\\|(.*)\\|\\|(.*)\\|\\|(.*)");

load data local inpath '/root/singer.txt' into table singer;
select * from singer;

情况二

192.168.88.134 [08/Nov/2020:10:44:32 +0800] "GET / HTTP/1.1" 404 951
192.168.88.100 [08/Nov/2020:10:44:33 +0800] "GET /hpsk_sdk/index.html HTTP/1.1" 200 328
192.168.88.134 [08/Nov/2020:20:19:06 +0800] "GET / HTTP/1.1" 404 951
192.168.88.100 [08/Nov/2020:20:19:13 +0800] "GET /hpsk_sdk/demo4.html HTTP/1.1" 200 982
192.168.88.100 [08/Nov/2020:20:19:13 +0800] "GET /hpsk_sdk/js/analytics.js HTTP/1.1" 200 11095
192.168.88.100 [08/Nov/2020:20:19:23 +0800] "GET /hpsk_sdk/demo3.html HTTP/1.1" 200 1024
192.168.88.100 [08/Nov/2020:20:19:26 +0800] "GET /hpsk_sdk/demo2.html HTTP/1.1" 200 854
192.168.88.100 [08/Nov/2020:20:19:27 +0800] "GET /hpsk_sdk/demo.html HTTP/1.1" 200 485
192.168.88.134 [08/Nov/2020:20:26:51 +0800] "GET / HTTP/1.1" 404 951
192.168.88.134 [08/Nov/2020:20:29:08 +0800] "GET / HTTP/1.1" 404 951
create table t_log
(
    ip     string, --IP地址
    stime  string, --时间
    mothed string, --请求方式
    url    string, --请求地址
    policy string, --请求协议
    stat   string, --请求状态
    body   string  --字节大小
)
    --指定使用RegexSerde加载数据
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
        --指定正则表达式
        WITH SERDEPROPERTIES (
        "input.regex" = "([^ ]*) ([^}]*) ([^ ]*) ([^ ]*) ([^ ]*) ([0-9]*) ([^ ]*)"
        )

load data local inpath '/root/a.log' into table t_log;

select * from t_log;
-- 时间转换
select
       from_unixtime(unix_timestamp(stime, '[dd/MMM/yyyy:HH:mm:ss +0800]'),'yyyy-MM-dd HH:mm:ss')
from t_log;

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/555660.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

jenkins 常见问题汇总

1、win11 节点&#xff08;Error: Unable to access jarfile slave.jar&#xff09; jenkins 默认cd 进入到设置的目录下面&#xff0c;如果不是C盘的话&#xff0c;直接cd 进入不了其他盘&#xff0c;所以&#xff0c;需要在命令前面&#xff0c;加参数进入到对应盘符。eg:E:…

pandas1

pandas pandas 的核心是&#xff1a;‘Series’、‘DataFrame’、Index’三个类型 1. 创建DataFrame对象 1.1 通过二维数组创建 scores np.random.randint(60,101,(5,3)) scoresarray([[ 91, 87, 87],[100, 80, 61],[ 76, 84, 80],[ 81, 97, 69],[ 67, 77, 65]]…

如何查看SSL证书的有效期?(中科三方)

SSL证书能够对数据传输进行加密处理&#xff0c;对网站的真实性进行核验&#xff0c;是网站提升数据安全能力的重要手段&#xff0c;现在已经有越来越多的网站开始安装SSL证书。但为了保障加密技术的快速更新&#xff0c;SSL证书的有效期逐渐缩短&#xff0c;而一旦SSL证书失效…

【腾讯云 Finops Crane 集训营】心得体会

【腾讯云 Finops Crane 集训营】心得体会 一直在关注技术社区的活动&#xff0c;希望看到更多的新技术&#xff0c;最近在逛 CSDN 的过程中&#xff0c;让我有机会参加了腾讯云的 Finops Crane 开源项目的第一季活动&#xff0c;从而深入了解了这个项目。Crane是一种云资源分析…

PMP常考知识点整理

1十大知识领域之项目整合管理 ❒ 变更控制流程&#xff08;简化版&#xff09; 书面记录变更请求→分析影响→提交CCB进行审批→批准或者拒绝→若批准&#xff0c;先修改计划&#xff08;体现变更&#xff09;&#xff0c;再通知变更受影响相关方&#xff0c;最后再执行、追踪…

【PCIE720】 基于PCIe总线架构的高性能计算(HPC)硬件加速卡

板卡概述 PCIE720是一款基于PCI Express总线架构的高性能计算&#xff08;HPC&#xff09;硬件加速卡&#xff0c;板卡采用Xilinx的高性能28nm 7系列FPGA作为运算节点&#xff0c;在资源、接口以及时钟的优化&#xff0c;为高性能计算提供卓越的硬件加速性能。板卡一共具有5个F…

Linux---文件操作命令(find、which、read)

1. find命令 find [路径] [参数] 要查找的目录路径&#xff0c;可以是一个目录或文件名&#xff0c;也可以是多个路径&#xff0c;多个路径之间用空格分隔&#xff0c;如 果未指定路径&#xff0c;则默认为当前目录。 可选参数&#xff0c;用于指定查找的条件&#xff0c;可…

day37_JQuery

今日内容 零、 复习昨日 一、JQuery 零、 复习昨日 正则 匹配,筛选字符串[0-9a-zA-ZA-z\d\w]*?{3}{4,}{5,10}^$reg.test(字符) jquery js封装的库,封装js操作,可以用来操作事件,dom,动画,ajax$("#id") $("element") $(".class")$("选择器…

chatgpt赋能Python-pythonwhile遍历

Python中使用while循环遍历的优势 Python是一种高级语言&#xff0c;广泛用于Web开发、数据科学、人工智能等方面。Python提供了多种循环结构&#xff0c;其中while循环是一种非常常用的遍历方式。在本篇文章中&#xff0c;我们将介绍如何在Python中使用while循环遍历&#xf…

A2L文件的自动生成(Simulink/CANape)

目录 什么是A2L文件&#xff1f; 使用simulink生成A2L文件 A2L文件组成 characteristic measurement compu_method group simulink生成的A2L与CANape生成的A2L 如何自动修改simulink生成A2L文件使其适用于CANape&#xff1f; 所需文件 什么是A2L文件&#xff1f; A2…

27 KVM管理系统资源-管理虚拟CPU份额

文章目录 27 KVM管理系统资源-管理虚拟CPU份额27.1 概述27.2 操作步骤 27 KVM管理系统资源-管理虚拟CPU份额 27.1 概述 虚拟化环境下&#xff0c;同一主机上的多个虚拟机竞争使用物理CPU。为了防止某些虚拟机占用过多的物理CPU资源&#xff0c;影响相同主机上其他虚拟机的性能…

什么是数字化校园,校园怎么数字化?

教育数字化转型是目前教育领域的一个热门话题&#xff0c;那么到底什么是教育数字化转型&#xff1f;如何做好教育数字化转型&#xff1f;这就来回答一下&#xff01; 阅读本文你将了解&#xff1a; 什么是教育数字化转型&#xff1f;零代码平台如何撬动教育数字化转型&#…

真别去阿里面试,6年测开经验的真实面试经历.....

前几天我朋友跟我吐苦水&#xff0c;这波面试又把他打击到了&#xff0c;做了快6年软件测试员。。。为了进大厂&#xff0c;也花了很多时间和精力在面试准备上&#xff0c;也刷了很多题。但题刷多了之后有点怀疑人生&#xff0c;不知道刷的这些题在之后的工作中能不能用到&…

Linux·eventfd 原理与实践

1. eventfd/timerfd 简介 目前越来越多的应用程序采用事件驱动的方式实现功能&#xff0c;如何高效地利用系统资源实现通知的管理和送达就愈发变得重要起来。在Linux系统中&#xff0c;eventfd是一个用来通知事件的文件描述符&#xff0c;timerfd是的定时器事件的文件描述符。…

防火墙(三)

firewalld防火墙 一、firewalld概述firewalld与iptables的区别firewalld区域firewalld数据处理流程 二、firewalld防火墙的使用配置方法常用的firewalld-cmd命令选项 三、操作小实验 一、firewalld概述 firewalld防火墙是Centos 7 系统默认的防火墙管理工具&#xff0c;取代了…

AWS设备自定义身份认证

AWS设备自定义身份认证需要通过lambda服务实现&#xff0c;具体来说&#xff0c;首先需要创建一个lambda函数&#xff0c;在函数中实现具体的认证逻辑&#xff0c;然后Iot在调用授权方时&#xff0c;将触发lambda函数&#xff0c;返回认证结果。 1.输入参数说明 授权方在调用…

Qt编程基础 | 使用VS创建空白Qt项目

一、使用VS创建空白Qt项目 使用VS创建空白Qt项目&#xff0c;如下&#xff1a; 步骤一&#xff1a;新建一个空白Qt项目 步骤二&#xff1a;手动添加需要的文件 头文件代码&#xff0c;如下&#xff1a; #include <QtWidgets/QApplication> #include <QWidget>int…

C++11 异常

文章目录 &#x1f356;异常是什么&#x1f32d;概念&#x1f32d;实现方式 &#x1f356;异常的使用和注意事项&#x1f32d;注意事项&#x1f32d;异常的重新抛出&#x1f32d;异常安全 &#x1f356;异常的规范&#x1f356;异常带来的优缺点 &#x1f356;异常是什么 &…

jQurey-基本知识点总结

&#xff08;一&#xff09;jQurey基础知识 1、官网下载&#xff1a;jQuery jQurey是一个js文件&#xff0c;直接存到项目文件中&#xff0c;然后跟平常文件js导入一致&#xff1a; <script src"js/jquery-3.7.0.js"></script> 2、jQurey语法 jQure…

邹检验,结构变化识别及其R语言实现

在描述多维数据的维度关系时&#xff0c;线性模型无疑应用最多。然而某些情况下&#xff0c;我们关心随着时间变化或随着样本分组&#xff0c;线性关系的具体参数是否发生了变化&#xff0c;即是否发生结构变化Structural break。邹检验Chow test提供了最基本的一种结构变化显著…