探讨Hive是否转为MapReduce程序

news2024/12/23 5:01:46

目录

前提条件

数据准备

探讨HQL是否转为MapReduce程序执行

1.设置hive.fetch.task.conversion=none

2.设置hive.fetch.task.conversion=minimal

3.设置hive.fetch.task.conversion=more


前提条件

Linux环境下安装好Hive,这里测试使用版本为:Hive2.3.6,Hive安装配置可参考:Hive安装配置

数据准备

创建hive表

hive> create table employee_3(
  name           STRING,
  salary         FLOAT,
  subordinates   ARRAY<STRING> ,
  deductions     MAP<STRING,FLOAT>,
  address        STRUCT<street : STRING, city : STRING, state :        STRING, zip : INT>)
     row format delimited fields terminated by '\t' collection items terminated by ',' map keys terminated by ':';

本地数据

[hadoop@node1 ~]$ vim emp3.txt
Zhangsan	3000	li1,li2,li3	cd:30,zt:50,sw:100	huayanlu,Guiyang,China,550025
Lisi	4000	w1,w2,w3	cd:10,zt:40,sw:33	changlingjiedao,Guiyang,China,550081
Zhangsan	3000	li1,li2,li3	cd:30,zt:50,sw:100	huayanlu,Guiyang,China,550025
Lisi	4000	w1,w2,w3	cd:10,zt:40,sw:33	changlingjiedao,Guiyang,China,550081

加载数据

hive> load data local inpath 'emp3.txt' into table employee_3;

查看数据

hive> select * from employee_3;
OK
Zhangsan	3000.0	["li1","li2","li3"]	{"cd":30.0,"zt":50.0,"sw":100.0}       {"street":"huayanlu","city":"Guiyang","state":"China","zip":550025}
Lisi	4000.0	["w1","w2","w3"]	{"cd":10.0,"zt":40.0,"sw":33.0}	{"street":"changlingjiedao","city":"Guiyang","state":"China","zip":550081}
Zhangsan	3000.0	["li1","li2","li3"]	{"cd":30.0,"zt":50.0,"sw":100.0}       {"street":"huayanlu","city":"Guiyang","state":"China","zip":550025}
Lisi	4000.0	["w1","w2","w3"]	{"cd":10.0,"zt":40.0,"sw":33.0}	{"street":"changlingjiedao","city":"Guiyang","state":"China","zip":550081}
Time taken: 0.194 seconds, Fetched: 4 row(s)

我们会发现,select *  操作是直接出结果的,没有转为MapReduce程序执行。

那什么情况下能触发MapReduce操作呢?依据是什么?

探讨HQL是否转为MapReduce程序执行

查看hive-default.xml.template文件

[hadoop@node1 ~]$ cd $HIVE_HOME/conf 
[hadoop@node1 conf]$ ls
beeline-log4j2.properties.template    hive-site.xml
hive-default.xml.template             ivysettings.xml
hive-env.sh.template                  llap-cli-log4j2.properties.template
hive-exec-log4j2.properties.template  llap-daemon-log4j2.properties.template
hive-log4j2.properties.template       parquet-logging.properties
[hadoop@node1 conf]$ vim hive-default.xml.template 

/task.conversion搜索task.conversion相关配置

<property>
    <name>hive.fetch.task.conversion</name>
    <value>more</value>
    <description>
      Expects one of [none, minimal, more].
      Some select queries can be converted to single FETCH task minimizing latency.
      Currently the query should be single sourced not having any subquery and should not have
      any aggregations or distincts (which incurs RS), lateral views and joins.
      0. none : disable hive.fetch.task.conversion
      1. minimal : SELECT STAR, FILTER on partition columns, LIMIT only
      2. more    : SELECT, FILTER, LIMIT only (support TABLESAMPLE and virtual columns)
    </description>
  </property>

可以看到配置项hive.fetch.task.conversion默认配置为more, 配置值除了more之外,还可以配置为noneminimal.

fetch 翻译为"抓取"。fetch是指某些HQL操作可以不必使用 MapReduce 计算,直接到表对应的数据存储目录抓取到相应的数据,直接通过Fatch task返回给客户端。

启用 MapReduce Job 是会消耗系统开销的。对于这个问题,从 Hive0.10.0 版本开始,对于简单的不需要聚合的类似 select <col> from <table> limit n语句,不需要起 MapReduce job,直接通过 Fetch task 获取数据。

比如:select * from user_table;在这种情况下,Hive 可以简单地抓取 user_table 对应的存储目录下的文件,然后输出查询结果到控制台。

1.设置hive.fetch.task.conversion=none

官方解释:

none : disable hive.fetch.task.conversion

禁用fetch操作

fetch.task为none的意思是,不直接抓取表对应的存储数据,返回的数据都需要通过执行MapReduce得到,这时候,只有desc操作不走MapReduce程序。

设置hive.fetch.task.conversion=none

hive> set hive.fetch.task.conversion=none;

测试desc,没有走MapReduce程序

hive> desc employee_3;
OK
name                	string              	                    
salary              	float               	                    
subordinates        	array<string>       	                    
deductions          	map<string,float>   	                    
address             	struct<street:string,city:string,state:string,zip:int>                     
Time taken: 0.187 seconds, Fetched: 5 row(s)

测试其他操作,例如:select * 操作,从执行日志中看到,这个操作需要走MapReduce程序(有Map,没有Reduce)

hive> select * from employee_3;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hadoop_20230416115907_93e4dc77-02cb-4caf-a16b-24749a747bde
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1681614461744_0002, Tracking URL = http://node1:8088/proxy/application_1681614461744_0002/
Kill Command = /home/hadoop/soft/hadoop/bin/hadoop job  -kill job_1681614461744_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2023-04-16 11:59:17,681 Stage-1 map = 0%,  reduce = 0%
2023-04-16 11:59:26,347 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.18 sec
MapReduce Total cumulative CPU time: 4 seconds 180 msec
Ended Job = job_1681614461744_0002
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 4.18 sec   HDFS Read: 5579 HDFS Write: 459 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 180 msec
OK
Zhangsan	3000.0	["li1","li2","li3"]	{"cd":30.0,"zt":50.0,"sw":100.0}       {"street":"huayanlu","city":"Guiyang","state":"China","zip":550025}
Lisi	4000.0	["w1","w2","w3"]	{"cd":10.0,"zt":40.0,"sw":33.0}	{"street":"changlingjiedao","city":"Guiyang","state":"China","zip":550081}
Zhangsan	3000.0	["li1","li2","li3"]	{"cd":30.0,"zt":50.0,"sw":100.0}       {"street":"huayanlu","city":"Guiyang","state":"China","zip":550025}
Lisi	4000.0	["w1","w2","w3"]	{"cd":10.0,"zt":40.0,"sw":33.0}	{"street":"changlingjiedao","city":"Guiyang","state":"China","zip":550081}
Time taken: 20.959 seconds, Fetched: 4 row(s)

浏览器查看8088端口

 

2.设置hive.fetch.task.conversion=minimal

官方解释:

minimal : SELECT STAR, FILTER on partition columns, LIMIT only

仅仅 Select  * 操作、过滤数据(从某个分区拿到的列数据)、limit操作 可以使用fetch操作。

设置fetch.task为minimal,最少使用fetch操作,desc和select * 、limit 操作 不走MapReduce,其余都要走MapReduce程序。

hive> set hive.fetch.task.conversion=minimal;

测试 desc 和 select * 操作,是直接返回结果的,不走MapReduce程序

hive> desc employee_3;
OK
name                	string              	                    
salary              	float               	                    
subordinates        	array<string>       	                    
deductions          	map<string,float>   	                    
address             	struct<street:string,city:string,state:string,zip:int>                     
Time taken: 0.044 seconds, Fetched: 5 row(s)

hive> select * from employee_3;
OK
Zhangsan	3000.0	["li1","li2","li3"]	{"cd":30.0,"zt":50.0,"sw":100.0}       {"street":"huayanlu","city":"Guiyang","state":"China","zip":550025}
Lisi	4000.0	["w1","w2","w3"]	{"cd":10.0,"zt":40.0,"sw":33.0}	{"street":"changlingjiedao","city":"Guiyang","state":"China","zip":550081}
Zhangsan	3000.0	["li1","li2","li3"]	{"cd":30.0,"zt":50.0,"sw":100.0}       {"street":"huayanlu","city":"Guiyang","state":"China","zip":550025}
Lisi	4000.0	["w1","w2","w3"]	{"cd":10.0,"zt":40.0,"sw":33.0}	{"street":"changlingjiedao","city":"Guiyang","state":"China","zip":550081}
Time taken: 0.215 seconds, Fetched: 4 row(s)

hive> select * from employee_3 limit 1;
OK
Zhangsan	3000.0	["li1","li2","li3"]	{"cd":30.0,"zt":50.0,"sw":100.0}       {"street":"huayanlu","city":"Guiyang","state":"China","zip":550025}
Time taken: 0.168 seconds, Fetched: 1 row(s)

测试其他情况,走MapReduce

hive> select salary from employee_3 where name in ("Lisi");
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hadoop_20230416120829_de5cc03b-6736-45ce-98e4-aa2bc0446313
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1681614461744_0003, Tracking URL = http://node1:8088/proxy/application_1681614461744_0003/
Kill Command = /home/hadoop/soft/hadoop/bin/hadoop job  -kill job_1681614461744_0003
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2023-04-16 12:08:41,741 Stage-1 map = 0%,  reduce = 0%
2023-04-16 12:08:52,660 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 5.68 sec
MapReduce Total cumulative CPU time: 5 seconds 680 msec
Ended Job = job_1681614461744_0003
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1   Cumulative CPU: 5.68 sec   HDFS Read: 5404 HDFS Write: 125 SUCCESS
Total MapReduce CPU Time Spent: 5 seconds 680 msec
OK
4000.0
4000.0
Time taken: 23.937 seconds, Fetched: 2 row(s)

3.设置hive.fetch.task.conversion=more

官方解释:

more    : SELECT, FILTER, LIMIT only (support TABLESAMPLE and virtual columns)

仅仅 Select * 操作、过滤操作、limit操作(支持表提供的列和虚拟列) 可以使用fetch操作。

设置fetch.task为more,最多使用fetch操作,desc、select * 、select * from user_table where column_n in (“a”, “b”)过滤、limit操作,不走MapReduce操作。

测试,不走MapReduce的操作

hive> set hive.fetch.task.conversion=more;
hive> select salary from employee_3 where name in ("Lisi");
OK
4000.0
4000.0
Time taken: 0.425 seconds, Fetched: 2 row(s)
hive> desc employee_3;
OK
name                	string              	                    
salary              	float               	                    
subordinates        	array<string>       	                    
deductions          	map<string,float>   	                    
address             	struct<street:string,city:string,state:string,zip:int>                     
Time taken: 0.067 seconds, Fetched: 5 row(s)
hive> select * from employee_3;
OK
zhangsan	3000.0	["li1","li2","li3"]	{"cd":30.0,"zt":50.0,"sw":100.0}       {"street":"huayanlu","city":"Guiyang","state":"China","zip":550025}
Lisi	4000.0	["w1","w2","w3"]	{"cd":10.0,"zt":40.0,"sw":33.0}	{"street":"changlingjiedao","city":"Guiyang","state":"China","zip":550081}
zhangsan	3000.0	["li1","li2","li3"]	{"cd":30.0,"zt":50.0,"sw":100.0}       {"street":"huayanlu","city":"Guiyang","state":"China","zip":550025}
Lisi	4000.0	["w1","w2","w3"]	{"cd":10.0,"zt":40.0,"sw":33.0}	{"street":"changlingjiedao","city":"Guiyang","state":"China","zip":550081}
Time taken: 0.194 seconds, Fetched: 4 row(s)

hive> select * from employee_3 limit 1;
OK
Zhangsan	3000.0	["li1","li2","li3"]	{"cd":30.0,"zt":50.0,"sw":100.0}       {"street":"huayanlu","city":"Guiyang","state":"China","zip":550025}
Time taken: 0.168 seconds, Fetched: 1 row(s)

测试需要走MapReduce程序的操作,例如:统计操作,从输出日志得知,需要执行MapReduce操作(包括Map操作和Reduce操作)

hive> select count(1) from employee_3;
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hadoop_20230416134802_ac41c52d-be35-4515-a678-70e43dec35fc
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1681614461744_0006, Tracking URL = http://node1:8088/proxy/application_1681614461744_0006/
Kill Command = /home/hadoop/soft/hadoop/bin/hadoop job  -kill job_1681614461744_0006
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2023-04-16 13:48:12,085 Stage-1 map = 0%,  reduce = 0%
2023-04-16 13:48:19,440 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 3.39 sec
2023-04-16 13:48:26,852 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 6.39 sec
MapReduce Total cumulative CPU time: 6 seconds 390 msec
Ended Job = job_1681614461744_0006
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 6.39 sec   HDFS Read: 9446 HDFS Write: 101 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 390 msec
OK
4
Time taken: 26.934 seconds, Fetched: 1 row(s)

参考链接:

HIVE 调优—— hive.fetch.task.conversion - 简书

Hive SQL触发MR的情况_hive中什么哪些语句会执行mr_AAcoding的博客-CSDN博客

完成!enjoy it!

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/417481.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

【结构型模式】适配者模式

文章目录优秀借鉴1、简介2、结构3、实现方式3.1、案例引入3.2、类适配器3.3、对象适配器3.4、接口适配器4、区别对比5、适配者模式优缺点6、应用场景优秀借鉴 黑马程序员Java设计模式详解-适配器模式概述适配器设计模式&#xff08;封装器模式&#xff09;一文彻底弄懂适配器模…

页眉怎么添加【节】,设置不同章节不同页眉

文章目录前言添加【节】&#xff0c;设置不同内容总结前言 大家写文档或者论文的时候可能会需要&#xff1a;不同章节页眉展示不同的内容 然而&#xff0c;在双击页眉进行编辑的时候却发现几个章节的页眉一起被修改了&#xff1a; 会出现文章与页眉不同步的情况&#xff0c…

idea使用Junit

文章目录 idea使用JunitJunit配置常用注解常用于测试的断言方法后续idea使用Junit 对项目使用Junit主要有两个步骤: 添加Junit依赖,即添加Junit jar包使用JunitJunit配置 方法一:idea自带的快捷方法 对要测试的类的方法,在该类中,右键鼠标呼出菜单,选择Generate,快捷…

简单的回顾Linux

linux命令ls会显示出文件的颜色, 系统约定的默认颜色含义如下: 白色&#xff1a;表示普通文件 蓝色&#xff1a;表示目录 绿色&#xff1a;表示可执行文件 红色&#xff1a;表示压缩文件 浅蓝色&#xff1a;链接文件 主要是使用ln命令建立的文件 红色闪烁&#xff1a;表示链接的…

Java实现打印杨辉三角形,向左、右偏的平行四边形这三个图形代码程序

目录 前言 一、打印杨辉三角形 1.1运行流程&#xff08;思想&#xff09; 1.2代码段 1.3运行截图 二、向左偏的平行四边形 1.1运行流程&#xff08;思想&#xff09; 1.2代码段 1.3运行截图 三、向右偏的平行四边形 1.1运行流程&#xff08;思想&#xff09; 1.2代…

inplace-operation-error 【已解决】

最近在搞CT医学图像分割模型的领域泛化优化&#xff0c;结果就出现了报错&#xff1a; 关于这个问题stackoverflow上有非常多的讨论&#xff0c;可以过去围观&#xff1a; 指路&#xff1a;中文版stackoverflow - 堆栈内存溢出 (stackoom.com) Stack Overflow - Where Develo…

UNET-RKNN分割眼底血管

前言 最近找到一个比较好玩的Unet分割项目&#xff0c;Unet的出现就是为了在医学上进行分割(比如细胞或者血管)&#xff0c;这里进行眼底血管的分割&#xff0c;用的backbone是VGG16&#xff0c;结构如下如所示(项目里面的图片&#xff0c;借用的&#xff01;借用标记出处&…

C语言函数大全--h开头的函数

C语言函数大全 本篇介绍C语言函数大全–h开头的函数或宏 1. hypot&#xff0c;hypotf&#xff0c;hypotl 1.1 函数说明 函数声明函数功能double hypot(double x, double y);计算直角三角形的斜边长&#xff08;double&#xff09;float hypotf (float x, float y);计算直角…

UPA/URA双极化天线的协方差矩阵结构

文章目录UPA的阵列响应向量&#xff08;暂不考虑双极化天线&#xff09;UPA阵列响应&#xff1a;从单极化天线到双极化天线UPA双极化天线的协方差矩阵结构参考文献UPA的阵列响应向量&#xff08;暂不考虑双极化天线&#xff09; 下图形象描述了UPA阵列的接收信号 UPA阵列的水平…

【springcloud 微服务】Spring Cloud 微服务网关Gateway使用详解

目录 一、微服务网关简介 1.1 网关的作用 1.2 常用网关 1.2.1 传统网关 1.2.2 云原生网关 二、gateway网关介绍 2.1 问题起源 2.2 引发的问题 2.2.1 重复造轮子 2.2.2 调用低效 2.2.3 重构复杂 2.3 gateway改进 三、Spring Cloud Gateway 介绍 3.1 Gateway 概述 …

【JSON学习笔记】3.JSON.parse()及JSON.stringify()

前言 本章介绍JSON.parse()及JSON.stringify()。 JSON.parse() JSON 通常用于与服务端交换数据。 在接收服务器数据时一般是字符串。 我们可以使用 JSON.parse() 方法将数据转换为 JavaScript 对象。 语法 JSON.parse(text[, reviver])参数说明&#xff1a; text:必需&…

Angular可视化指南 - 用Kendo UI图表组件创建数据可视化

Kendo UI for Angular是专业级的Angular UI组件库&#xff0c;不仅是将其他供应商提供的现有组件封装起来&#xff0c;telerik致力于提供纯粹高性能的Angular UI组件&#xff0c;而无需任何jQuery依赖关系。无论您是使用TypeScript还是JavaScript开发Angular应用程序&#xff0…

【机器学习(二)】线性回归之梯度下降法

文章目录专栏导读1、梯度下降法原理2、梯度下降法原理代码实现3、sklearn内置模块实现专栏导读 ✍ 作者简介&#xff1a;i阿极&#xff0c;CSDN Python领域新星创作者&#xff0c;专注于分享python领域知识。 ✍ 本文录入于《数据分析之术》&#xff0c;本专栏精选了经典的机器…

1漏洞发现

漏洞发现-操作系统之漏洞探针类型利用修复 一、操作系统漏洞思维导图 相关名词解释&#xff1a; CVSS&#xff08;Common Vulnerability Scoring System&#xff0c;即“通用漏洞评分系统”&#xff09; CVSS是安全内容自动化协议&#xff08;SCAP&#xff09;的一部分通常C…

rockchip rk3588添加uvc及uvc,adb的复合设备

软硬件环境&#xff1a; 软件基础&#xff1a;我目前拿到的rk3588 sdk &#xff1a;gitwww.rockchip.com.cn:2222/Android_S/rk3588- manifests.git硬件基础&#xff1a;RK3588 LP4X EVB uvc_app: 从rv1126 sdk中rv1126_sdk/rv1126/external/uvc_app 目录移植而来。移植后&…

能翻译大量文字的软件-正规的翻译软件

复制自动翻译软件是一种能够复制并自动翻译文本的工具。当您阅读某一种语言的文本时&#xff0c;这种软件可以快速识别并翻译出来&#xff0c;以方便您更好地理解内容。与其他翻译软件不同的是&#xff0c;复制自动翻译软件可以直接在游览网站的过程中&#xff0c;直接对用户正…

【C++】命名空间,缺省参数,函数重载,引用,内联函数

目录1. 命名空间2. 输入输出3. 缺省参数4. 函数重载为什么C支持函数重载&#xff1f;5. 引用5.1 引用作函数参数&#xff08;输出型参数&#xff09;5.2 作函数的返回值关于函数的返回值&#xff1a;5.3 引用权限关于类型转换&#xff1a;5.4 引用和指针6. 内联函数6.1 C推荐的…

【千题案例】TypeScript获取两点之间的距离 | 中点 | 补点 | 向量 | 角度

我们在编写一些瞄准、绘制、擦除等功能函数时&#xff0c;经常会遇到计算两点之间的一些参数&#xff0c;那本篇文章就来讲一下两点之间的一系列参数计算。 目录 1️⃣ 两点之间的距离 ①实现原理 ②代码实现及结果 2️⃣两点之间的中点 ①实现原理 ②代码实现及结果 3…

JUC结构

JUC是java.util.concurrent包的简称在Java5.0添加&#xff0c;目的就是为了更好的支持高并发任务。让开发者进行多线程编程时减少竞争条件和死锁的问题&#xff01;进程与线程的区别&#xff1a;进程 : 一个运行中的程序的集合; 一个进程往往可以包含多个线程,至少包含一个线程…

count、sum、avg、max、min函数MySQL数据库 - 使用聚合函数查询(头歌实践教学平台)

文章目的初衷是希望学习笔记分享给更多的伙伴&#xff0c;并无盈利目的&#xff0c;尊重版权&#xff0c;如有侵犯&#xff0c;请官方工作人员联系博主谢谢。 目录 第1关&#xff1a;COUNT( )函数 任务描述 相关知识 COUNT()函数基本使用 编程要求 第2关&#xff1a;SUM(…