大数据技术-Hadoop(四)Yarn的介绍与使用

news2025/1/6 20:39:38

目录

一、Yarn 基本结构

1、Yarn基本结构

2、Yarn的工作机制

二、Yarn常用的命令

三、调度器

1、Capacity Scheduler(容量调度器)

1.1、特点

1.2、配置

1.2.1、yarn-site.xml

1.2.2、capacity-scheduler.xml

 1.3、重启yarn、刷新队列 测试 向hive 队列提交数据

1.4、优先级配置

2、Fair Scheduler(公平调度器)

2.1、特点

2.2、配置

2.2.1、yarn-site.xml

 2.2.2、fair-scheduler.xml

 2.3、测试

四、Yarn的Tool接口案例

1、代码

2、测试

3、完整代码


一、Yarn 基本结构

1、Yarn基本结构

YARN主要由ResourceManager、NodeManager、ApplicationMaster和Container等组件构成。

ResourceManager(RM)主要作用

  • 处理客户端请求
  • 监控NodeManager
  • 启动监控ApplicationMaster
  • 资源的分配和调度(cpu, memory, disk, network)

NodeManager(NM)主要作用

  • 管理单个基点上的资源(cpu, memory, disk, network)
  • 处理来自ResourceManager的命令
  • 处理来自ApplicationMaster的命令

ApplicationMaster (Am)主要作用

  • 为应用程序申请资源并分配任务
  • 任务监控与容错

Container 主要作用

它基于资源容器的抽象概念,该容器包含内存、CPU、磁盘、网络等元素。

2、Yarn的工作机制

  1. MR程序提交到客户端所在的节点。
  2. YarnRunner向ResourceManager申请一个Application。
  3. RM将该应用程序的资源路径返回给YarnRunner。
  4. 该程序将运行所需资源提交到HDFS上。
  5. 程序资源提交完毕后,申请运行mrAppMaster。
  6. RM将用户的请求初始化成一个Task。
  7. 其中一个NodeManager领取到Task任务。
  8. 该NodeManager创建容器Container,并产生MRAppmaster。
  9. Container从HDFS上拷贝资源到本地。
  10. MRAppmaster向RM 申请运行MapTask资源。
  11. RM将运行MapTask任务分配给另外两个NodeManager,另两个NodeManager分别领取任务并创建容器。
  12. MR向两个接收到任务的NodeManager发送程序启动脚本,这两个NodeManager分别启动MapTask,MapTask对数据分区排序。
  13. MrAppMaster等待所有MapTask运行完毕后,向RM申请容器,运行ReduceTask。
  14. ReduceTask向MapTask获取相应分区的数据。
  15. 程序运行完毕后,MR会向RM申请注销自己。

二、Yarn常用的命令

# 列出运行的应用
yarn app -list 或者 yarn application -list

#根据状态查询  ALL,NEW,NEW_SAVING,SUBMITTED,ACCEPTED,RUNNING,FINISHED,FAILED,KILLED
yarn application -list -appStates ALL

#查看日志
yarn logs -applicationId application_1735538046453_0007

#停止任务
yarn app -kill application_1735538046453_0007 

#查看尝试运行的任务
yarn applicationattempt -list application_1735538046453_0007

#查看ApplicationAttemp状态
yarn applicationattempt -status appattempt_1735538046453_0007_000001 #这个id为attempt的id

# 列出容器
yarn container -list appattempt_1735538046453_0007_000001

#查看容器状态
 yarn container -status <container-id>

 #查看节点状态
 yarn node -list -all

 #刷新队列
 yarn rmadmin -refreshQueues

 #查看队列
 yarn queue -list all

 #查看队列状态 default为队列名称
 yarn queue -status default  

 
更多命令 https://hadoop.apache.org/docs/r3.4.0/hadoop-yarn/hadoop-yarn-site/YarnCommands.html

三、调度器

1、Capacity Scheduler(容量调度器)

1.1、特点

  • 分层队列- 支持队列层次结构,以确保在允许其他队列使用可用资源之前,在组织的子队列之间共享资源,从而提供更多的控制和可预测性。
  • 容量保证- 队列分配到网格容量的一小部分,即一定容量的资源可供队列使用。提交到队列的所有应用程序都可以访问分配给该队列的容量。管理员可以对分配给每个队列的容量配置软限制和可选的硬限制。
  • 安全性- 每个队列都有严格的 ACL,控制哪些用户可以向各个队列提交申请。
  • 弹性- 可以将空闲资源分配给超出其容量的任何队列。
  • 多租户- 提供全面的限制,以防止单个应用程序、用户和队列垄断队列或整个集群的资源,以确保集群不会不堪重负。
  • 可操作性
    • 运行时配置 - 管理员可以在运行时以安全的方式更改队列定义和属性(例如容量、ACL),以尽量减少对用户的干扰
    • 清空应用程序 - 管理员可以在运行时停止队列,以确保在现有应用程序运行完成时,不会提交任何新应用程序。
  • 基于资源的调度- 支持资源密集型应用程序。
  • 基于默认或用户定义的放置规则的队列映射接口: 此功能允许用户根据某些默认放置规则将作业映射到特定队列。
  • 优先级调度- 此功能允许以不同的优先级提交和调度应用程序。整数值越高,应用程序的优先级越高。目前,应用程序优先级仅支持 FIFO 排序策略。
  • 百分比资源配置-管理员可以指定队列的资源百分比。
  • 绝对资源配置- 管理员可以指定队列的绝对资源,而不是提供基于百分比的值。这为管理员提供了更好的控制,可以配置给定队列所需的资源量。
  • 权重资源配置- 管理员可以指定队列的权重,而不是提供基于百分比的值。这为管理员提供了更好的控制,可以在动态变化的队列层次结构中为队列配置资源。
  • 通用容量向量资源配置- 管理员可以针对每个定义的资源类型使用绝对、权重或百分比模式以混合方式为队列指定资源。这为配置给定队列所需的资源量提供了最灵活的方式。
  • 动态自动创建和管理叶队列- 此功能支持自动创建叶队列以及队列映射,目前支持基于用户组的队列映射,以便将应用程序放置到队列。调度程序还支持根据父队列上配置的策略对这些队列进行容量管理。

1.2、配置

1.2.1、yarn-site.xml
<!-- 选择调度器,默认容量调度器 -->
<property>
	<description>The class to use as the resource scheduler.</description>
	<name>yarn.resourcemanager.scheduler.class</name>
	<value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
</property>

<!-- ResourceManager处理调度器请求的线程数量,默认50;如果提交的任务数大于50,可以增加该值,但是不能超过3台 * 4线程 = 12线程(去除其他应用程序实际不能超过8) -->
<property>
	<description>Number of threads to handle scheduler interface.</description>
	<name>yarn.resourcemanager.scheduler.client.thread-count</name>
	<value>8</value>
</property>

<!-- 是否让yarn自动检测硬件进行配置,默认是false,如果该节点有很多其他应用程序,建议手动配置。如果该节点没有其他应用程序,可以采用自动 -->
<property>
	<description>Enable auto-detection of node capabilities such as
	memory and CPU.
	</description>
	<name>yarn.nodemanager.resource.detect-hardware-capabilities</name>
	<value>false</value>
</property>

<!-- 是否将虚拟核数当作CPU核数,默认是false,采用物理CPU核数 -->
<property>
	<description>Flag to determine if logical processors(such as
	hyperthreads) should be counted as cores. Only applicable on Linux
	when yarn.nodemanager.resource.cpu-vcores is set to -1 and
	yarn.nodemanager.resource.detect-hardware-capabilities is true.
	</description>
	<name>yarn.nodemanager.resource.count-logical-processors-as-cores</name>
	<value>false</value>
</property>

<!-- 虚拟核数和物理核数乘数,默认是1.0 -->
<property>
	<description>Multiplier to determine how to convert phyiscal cores to
	vcores. This value is used if yarn.nodemanager.resource.cpu-vcores
	is set to -1(which implies auto-calculate vcores) and
	yarn.nodemanager.resource.detect-hardware-capabilities is set to true. The	number of vcores will be calculated as	number of CPUs * multiplier.
	</description>
	<name>yarn.nodemanager.resource.pcores-vcores-multiplier</name>
	<value>1.0</value>
</property>

<!-- NodeManager使用内存数,默认8G,修改为4G内存 -->
<property>
	<description>Amount of physical memory, in MB, that can be allocated 
	for containers. If set to -1 and
	yarn.nodemanager.resource.detect-hardware-capabilities is true, it is
	automatically calculated(in case of Windows and Linux).
	In other cases, the default is 8192MB.
	</description>
	<name>yarn.nodemanager.resource.memory-mb</name>
	<value>4096</value>
</property>

<!-- nodemanager的CPU核数,不按照硬件环境自动设定时默认是8个,修改为4个 -->
<property>
	<description>Number of vcores that can be allocated
	for containers. This is used by the RM scheduler when allocating
	resources for containers. This is not used to limit the number of
	CPUs used by YARN containers. If it is set to -1 and
	yarn.nodemanager.resource.detect-hardware-capabilities is true, it is
	automatically determined from the hardware in case of Windows and Linux.
	In other cases, number of vcores is 8 by default.</description>
	<name>yarn.nodemanager.resource.cpu-vcores</name>
	<value>4</value>
</property>

<!-- 容器最小内存,默认1G -->
<property>
	<description>The minimum allocation for every container request at the RM	in MBs. Memory requests lower than this will be set to the value of this	property. Additionally, a node manager that is configured to have less memory	than this value will be shut down by the resource manager.
	</description>
	<name>yarn.scheduler.minimum-allocation-mb</name>
	<value>1024</value>
</property>

<!-- 容器最大内存,默认8G,修改为2G -->
<property>
	<description>The maximum allocation for every container request at the RM	in MBs. Memory requests higher than this will throw an	InvalidResourceRequestException.
	</description>
	<name>yarn.scheduler.maximum-allocation-mb</name>
	<value>2048</value>
</property>

<!-- 容器最小CPU核数,默认1个 -->
<property>
	<description>The minimum allocation for every container request at the RM	in terms of virtual CPU cores. Requests lower than this will be set to the	value of this property. Additionally, a node manager that is configured to	have fewer virtual cores than this value will be shut down by the resource	manager.
	</description>
	<name>yarn.scheduler.minimum-allocation-vcores</name>
	<value>1</value>
</property>

<!-- 容器最大CPU核数,默认4个,修改为2个 -->
<property>
	<description>The maximum allocation for every container request at the RM	in terms of virtual CPU cores. Requests higher than this will throw an
	InvalidResourceRequestException.</description>
	<name>yarn.scheduler.maximum-allocation-vcores</name>
	<value>2</value>
</property>

<!-- 虚拟内存检查,默认打开,修改为关闭 -->
<property>
	<description>Whether virtual memory limits will be enforced for
	containers.</description>
	<name>yarn.nodemanager.vmem-check-enabled</name>
	<value>false</value>
</property>

<!-- 虚拟内存和物理内存设置比例,默认2.1 -->
<property>
	<description>Ratio between virtual memory to physical memory when	setting memory limits for containers. Container allocations are	expressed in terms of physical memory, and virtual memory usage	is allowed to exceed this allocation by this ratio.
	</description>
	<name>yarn.nodemanager.vmem-pmem-ratio</name>
	<value>2.1</value>
</property>
1.2.2、capacity-scheduler.xml
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<configuration>
    <property>
        <name>yarn.scheduler.capacity.maximum-applications</name>
        <value>10000</value>
        <description>
      Maximum number of applications that can be pending and running.
    </description>
    </property>
    <property>
        <name>yarn.scheduler.capacity.maximum-am-resource-percent</name>
        <value>0.1</value>
        <description>
      Maximum percent of resources in the cluster which can be used to run 
      application masters i.e. controls number of concurrent running
      applications.
    </description>
    </property>
    <property>
        <name>yarn.scheduler.capacity.resource-calculator</name>
        <value>org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator</value>
        <description>
      The ResourceCalculator implementation to be used to compare 
      Resources in the scheduler.
      The default i.e. DefaultResourceCalculator only uses Memory while
      DominantResourceCalculator uses dominant-resource to compare 
      multi-dimensional resources such as Memory, CPU etc.
    </description>
    </property>
    <!--添加队列-->
    <property>
        <name>yarn.scheduler.capacity.root.queues</name>
        <value>default,hive</value>
        <description>
      The queues at the this level (root is the root queue).
    </description>
    </property>
    <!--层级队列-->
    <!--  <property><name>yarn.scheduler.capacity.root.hive.queues</name><value>hive1,hive2</value><description>
      The queues at the this level (root is the root queue).
    </description></property>
  -->
    <!--默认队列百分比-->
    <property>
        <name>yarn.scheduler.capacity.root.default.capacity</name>
        <value>40</value>
        <description>Default queue target capacity.</description>
    </property>
    <!-- 指定hive队列的资源额定容量 -->
    <property>
        <name>yarn.scheduler.capacity.root.hive.capacity</name>
        <value>60</value>
    </property>
    <property>
        <name>yarn.scheduler.capacity.root.default.user-limit-factor</name>
        <value>1</value>
        <description>
      Default queue user limit a percentage from 0.0 to 1.0.
    </description>
    </property>
    <!-- 用户最多可以使用队列多少资源,1表示 -->
    <property>
        <name>yarn.scheduler.capacity.root.hive.user-limit-factor</name>
        <value>1</value>
        <description>
      Default queue user limit a percentage from 0.0 to 1.0.
    </description>
    </property>
    <property>
        <name>yarn.scheduler.capacity.root.default.maximum-capacity</name>
        <value>60</value>
        <description>
      The maximum capacity of the default queue. 
    </description>
    </property>
    <!--最大容量比-->
    <property>
        <name>yarn.scheduler.capacity.root.hive.maximum-capacity</name>
        <value>80</value>
        <description>
      The maximum capacity of the default queue. 
    </description>
    </property>
    <property>
        <name>yarn.scheduler.capacity.root.default.state</name>
        <value>RUNNING</value>
        <description>
      The state of the default queue. State can be one of RUNNING or STOPPED.
    </description>
    </property>
    <!--hive 队列状态-->
    <property>
        <name>yarn.scheduler.capacity.root.hive.state</name>
        <value>RUNNING</value>
        <description>
      The state of the default queue. State can be one of RUNNING or STOPPED.
    </description>
    </property>
    <property>
        <name>yarn.scheduler.capacity.root.default.acl_submit_applications</name>
        <value>*</value>
        <description>
      The ACL of who can submit jobs to the default queue.
    </description>
    </property>
    <!--*表示用户-->
    <property>
        <name>yarn.scheduler.capacity.root.hive.acl_submit_applications</name>
        <value>*</value>
        <description>
      The ACL of who can submit jobs to the default queue.
    </description>
    </property>
    <property>
        <name>yarn.scheduler.capacity.root.default.acl_administer_queue</name>
        <value>*</value>
        <description>
      The ACL of who can administer jobs on the default queue.
    </description>
    </property>
    <property>
        <name>yarn.scheduler.capacity.root.hive.acl_administer_queue</name>
        <value>*</value>
        <description>
      The ACL of who can administer jobs on the default queue.
    </description>
    </property>
    <property>
        <name>yarn.scheduler.capacity.root.default.acl_application_max_priority</name>
        <value>*</value>
        <description>
      The ACL of who can submit applications with configured priority.
      For e.g, [user={name} group={name} max_priority={priority} default_priority={priority}]
    </description>
    </property>
    <!--提交带有优先级的参数-->
    <property>
        <name>yarn.scheduler.capacity.root.hive.acl_application_max_priority</name>
        <value>*</value>
        <description>
      The ACL of who can submit applications with configured priority.
      For e.g, [user={name} group={name} max_priority={priority} default_priority={priority}]
    </description>
    </property>
    <property>
        <name>yarn.scheduler.capacity.root.default.maximum-application-lifetime
     </name>
        <value>-1</value>
        <description>
        Maximum lifetime of an application which is submitted to a queue
        in seconds. Any value less than or equal to zero will be considered as
        disabled.
        This will be a hard time limit for all applications in this
        queue. If positive value is configured then any application submitted
        to this queue will be killed after exceeds the configured lifetime.
        User can also specify lifetime per application basis in
        application submission context. But user lifetime will be
        overridden if it exceeds queue maximum lifetime. It is point-in-time
        configuration.
        Note : Configuring too low value will result in killing application
        sooner. This feature is applicable only for leaf queue.
     </description>
    </property>
    <!--提交到队列的应用程序的最大生命周期(以秒为单位)-->
    <property>
        <name>yarn.scheduler.capacity.root.hive.maximum-application-lifetime
     </name>
        <value>-1</value>
        <description>
        Maximum lifetime of an application which is submitted to a queue
        in seconds. Any value less than or equal to zero will be considered as
        disabled.
        This will be a hard time limit for all applications in this
        queue. If positive value is configured then any application submitted
        to this queue will be killed after exceeds the configured lifetime.
        User can also specify lifetime per application basis in
        application submission context. But user lifetime will be
        overridden if it exceeds queue maximum lifetime. It is point-in-time
        configuration.
        Note : Configuring too low value will result in killing application
        sooner. This feature is applicable only for leaf queue.
     </description>
    </property>
    <property>
        <name>yarn.scheduler.capacity.root.default.default-application-lifetime
     </name>
        <value>-1</value>
        <description>
        Default lifetime of an application which is submitted to a queue
        in seconds. Any value less than or equal to zero will be considered as
        disabled.
        If the user has not submitted application with lifetime value then this
        value will be taken. It is point-in-time configuration.
        Note : Default lifetime can't exceed maximum lifetime. This feature is
        applicable only for leaf queue.
     </description>
    </property>
    <!--提交到队列的应用程序的默认生命周期-->
    <property>
        <name>yarn.scheduler.capacity.root.hive.default-application-lifetime
     </name>
        <value>-1</value>
        <description>
        Default lifetime of an application which is submitted to a queue
        in seconds. Any value less than or equal to zero will be considered as
        disabled.
        If the user has not submitted application with lifetime value then this
        value will be taken. It is point-in-time configuration.
        Note : Default lifetime can't exceed maximum lifetime. This feature is
        applicable only for leaf queue.
     </description>
    </property>
    <property>
        <name>yarn.scheduler.capacity.node-locality-delay</name>
        <value>40</value>
        <description>
      Number of missed scheduling opportunities after which the CapacityScheduler 
      attempts to schedule rack-local containers.
      When setting this parameter, the size of the cluster should be taken into account.
      We use 40 as the default value, which is approximately the number of nodes in one rack.
      Note, if this value is -1, the locality constraint in the container request
      will be ignored, which disables the delay scheduling.
    </description>
    </property>
    <property>
        <name>yarn.scheduler.capacity.rack-locality-additional-delay</name>
        <value>-1</value>
        <description>
      Number of additional missed scheduling opportunities over the node-locality-delay
      ones, after which the CapacityScheduler attempts to schedule off-switch containers,
      instead of rack-local ones.
      Example: with node-locality-delay=40 and rack-locality-delay=20, the scheduler will
      attempt rack-local assignments after 40 missed opportunities, and off-switch assignments
      after 40+20=60 missed opportunities.
      When setting this parameter, the size of the cluster should be taken into account.
      We use -1 as the default value, which disables this feature. In this case, the number
      of missed opportunities for assigning off-switch containers is calculated based on
      the number of containers and unique locations specified in the resource request,
      as well as the size of the cluster.
    </description>
    </property>
    <property>
        <name>yarn.scheduler.capacity.queue-mappings</name>
        <value></value>
        <description>
      A list of mappings that will be used to assign jobs to queues
      The syntax for this list is [u|g]:[name]:[queue_name][,next mapping]*
      Typically this list will be used to map users to queues,
      for example, u:%user:%user maps all users to queues with the same name
      as the user.
    </description>
    </property>
    <property>
        <name>yarn.scheduler.capacity.queue-mappings-override.enable</name>
        <value>false</value>
        <description>
      If a queue mapping is present, will it override the value specified
      by the user? This can be used by administrators to place jobs in queues
      that are different than the one specified by the user.
      The default is false.
    </description>
    </property>
    <property>
        <name>yarn.scheduler.capacity.per-node-heartbeat.maximum-offswitch-assignments</name>
        <value>1</value>
        <description>
      Controls the number of OFF_SWITCH assignments allowed
      during a node's heartbeat. Increasing this value can improve
      scheduling rate for OFF_SWITCH containers. Lower values reduce
      "clumping" of applications on particular nodes. The default is 1.
      Legal values are 1-MAX_INT. This config is refreshable.
    </description>
    </property>
    <property>
        <name>yarn.scheduler.capacity.application.fail-fast</name>
        <value>false</value>
        <description>
      Whether RM should fail during recovery if previous applications'
      queue is no longer valid.
    </description>
    </property>
    <property>
        <name>yarn.scheduler.capacity.workflow-priority-mappings</name>
        <value></value>
        <description>
      A list of mappings that will be used to override application priority.
      The syntax for this list is
      [workflowId]:[full_queue_name]:[priority][,next mapping]*
      where an application submitted (or mapped to) queue "full_queue_name"
      and workflowId "workflowId" (as specified in application submission
      context) will be given priority "priority".
    </description>
    </property>
    <property>
        <name>yarn.scheduler.capacity.workflow-priority-mappings-override.enable</name>
        <value>false</value>
        <description>
      If a priority mapping is present, will it override the value specified
      by the user? This can be used by administrators to give applications a
      priority that is different than the one specified by the user.
      The default is false.
    </description>
    </property>
</configuration>

 1.3、重启yarn、刷新队列 测试 向hive 队列提交数据

hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.0.jar wordcount -D mapreduce.job.queuename=hive /input  /output11

1.4、优先级配置

yarn-site.xml

<property>
    <name>yarn.cluster.max-application-priority</name>
    <value>5</value>
</property>

重启 yarn 测试

hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.0.jar pi 5 2000000


hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.0.jar pi  -D mapreduce.job.priority=5 5 2000000

2、Fair Scheduler(公平调度器)

2.1、特点

  • 分层队列- 支持队列层次结构,以确保在允许其他队列使用可用资源之前,在组织的子队列之间共享资源,从而提供更多的控制和可预测性。
  • 容量保证- 队列分配到网格容量的一小部分,即一定容量的资源可供队列使用。提交到队列的所有应用程序都可以访问分配给该队列的容量。管理员可以对分配给每个队列的容量配置软限制和可选的硬限制。
  • 安全性- 每个队列都有严格的 ACL,控制哪些用户可以向各个队列提交申请。
  • 弹性- 可以将空闲资源分配给超出其容量的任何队列。
  • 多租户- 提供全面的限制,以防止单个应用程序、用户和队列垄断队列或整个集群的资源,以确保集群不会不堪重负。

与容量调度器不同的是调度策略不同,容量调度器优先选择资源利用率低的队列,公平调度器会优先选择对资源缺额比例大的。队列设置资源分配的方式不同,容量调度器为 FIFO(先进先出)和DRF(Dominant Resource Fairness Policy)。公平调度器为FifoPolicy、FairSharePolicy(默认)和 DominantResourceFairnessPolicy 。

2.2、配置

2.2.1、yarn-site.xml
<?xml version="1.0"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->
<configuration>
    <!-- Site specific YARN configuration properties -->
    <!-- 指定MR走shuffle -->
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <!-- 指定ResourceManager的地址-->
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>hadoop2</value>
    </property>
    <!-- 环境变量的继承 -->
    <property>
        <name>yarn.nodemanager.env-whitelist</name>
        <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,CLASSPATH_PREPEND_DISTCACHE,HADOOP_YARN_HOME,HADOOP_HOME,PATH,LANG,TZ,HADOOP_MAPRED_HOME</value>
    </property>
    <!-- 开启日志聚集功能 -->
    <property>
        <name>yarn.log-aggregation-enable</name>
        <value>true</value>
    </property>
    <!-- 设置日志聚集服务器地址 -->
    <property>
        <name>yarn.log.server.url</name>
        <value>http://hadoop102:19888/jobhistory/logs</value>
    </property>
    <!-- 设置日志保留时间为7天 -->
    <property>
        <name>yarn.log-aggregation.retain-seconds</name>
        <value>604800</value>
    </property>
    <!-- 选择调度器,默认容量调度器 -->
    <property>
        <description>The class to use as the resource scheduler.</description>
        <name>yarn.resourcemanager.scheduler.class</name>
        <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler</value>
    </property>
    <!-- ResourceManager处理调度器请求的线程数量,默认50;如果提交的任务数大于50,可以增加该值,但是不能超过3台 * 4线程 = 12线程(去除其他应用程序实际不能超过8) -->
    <property>
        <description>Number of threads to handle scheduler interface.</description>
        <name>yarn.resourcemanager.scheduler.client.thread-count</name>
        <value>8</value>
    </property>
    <!-- 是否让yarn自动检测硬件进行配置,默认是false,如果该节点有很多其他应用程序,建议手动配置。如果该节点没有其他应用程序,可以采用自动 -->
    <property>
        <description>Enable auto-detection of node capabilities such as
	memory and CPU.
	</description>
        <name>yarn.nodemanager.resource.detect-hardware-capabilities</name>
        <value>false</value>
    </property>
    <!-- 是否将虚拟核数当作CPU核数,默认是false,采用物理CPU核数 -->
    <property>
        <description>Flag to determine if logical processors(such as
	hyperthreads) should be counted as cores. Only applicable on Linux
	when yarn.nodemanager.resource.cpu-vcores is set to -1 and
	yarn.nodemanager.resource.detect-hardware-capabilities is true.
	</description>
        <name>yarn.nodemanager.resource.count-logical-processors-as-cores</name>
        <value>false</value>
    </property>
    <!-- 虚拟核数和物理核数乘数,默认是1.0 -->
    <property>
        <description>Multiplier to determine how to convert phyiscal cores to
	vcores. This value is used if yarn.nodemanager.resource.cpu-vcores
	is set to -1(which implies auto-calculate vcores) and
	yarn.nodemanager.resource.detect-hardware-capabilities is set to true. The	number of vcores will be calculated as	number of CPUs * multiplier.
	</description>
        <name>yarn.nodemanager.resource.pcores-vcores-multiplier</name>
        <value>1.0</value>
    </property>
    <!-- NodeManager使用内存数,默认8G,修改为4G内存 -->
    <property>
        <description>Amount of physical memory, in MB, that can be allocated 
	for containers. If set to -1 and
	yarn.nodemanager.resource.detect-hardware-capabilities is true, it is
	automatically calculated(in case of Windows and Linux).
	In other cases, the default is 8192MB.
	</description>
        <name>yarn.nodemanager.resource.memory-mb</name>
        <value>4096</value>
    </property>
    <!-- nodemanager的CPU核数,不按照硬件环境自动设定时默认是8个,修改为4个 -->
    <property>
        <description>Number of vcores that can be allocated
	for containers. This is used by the RM scheduler when allocating
	resources for containers. This is not used to limit the number of
	CPUs used by YARN containers. If it is set to -1 and
	yarn.nodemanager.resource.detect-hardware-capabilities is true, it is
	automatically determined from the hardware in case of Windows and Linux.
	In other cases, number of vcores is 8 by default.</description>
        <name>yarn.nodemanager.resource.cpu-vcores</name>
        <value>4</value>
    </property>
    <!-- 容器最小内存,默认1G -->
    <property>
        <description>The minimum allocation for every container request at the RM	in MBs. Memory requests lower than this will be set to the value of this	property. Additionally, a node manager that is configured to have less memory	than this value will be shut down by the resource manager.
	</description>
        <name>yarn.scheduler.minimum-allocation-mb</name>
        <value>1024</value>
    </property>
    <!-- 容器最大内存,默认8G,修改为2G -->
    <property>
        <description>The maximum allocation for every container request at the RM	in MBs. Memory requests higher than this will throw an	InvalidResourceRequestException.
	</description>
        <name>yarn.scheduler.maximum-allocation-mb</name>
        <value>2048</value>
    </property>
    <!-- 容器最小CPU核数,默认1个 -->
    <property>
        <description>The minimum allocation for every container request at the RM	in terms of virtual CPU cores. Requests lower than this will be set to the	value of this property. Additionally, a node manager that is configured to	have fewer virtual cores than this value will be shut down by the resource	manager.
	</description>
        <name>yarn.scheduler.minimum-allocation-vcores</name>
        <value>1</value>
    </property>
    <!-- 容器最大CPU核数,默认4个,修改为2个 -->
    <property>
        <description>The maximum allocation for every container request at the RM	in terms of virtual CPU cores. Requests higher than this will throw an
	InvalidResourceRequestException.</description>
        <name>yarn.scheduler.maximum-allocation-vcores</name>
        <value>2</value>
    </property>
    <!-- 虚拟内存检查,默认打开,修改为关闭 -->
    <property>
        <description>Whether virtual memory limits will be enforced for
	containers.</description>
        <name>yarn.nodemanager.vmem-check-enabled</name>
        <value>false</value>
    </property>
    <!-- 虚拟内存和物理内存设置比例,默认2.1 -->
    <property>
        <description>Ratio between virtual memory to physical memory when	setting memory limits for containers. Container allocations are	expressed in terms of physical memory, and virtual memory usage	is allowed to exceed this allocation by this ratio.
	</description>
        <name>yarn.nodemanager.vmem-pmem-ratio</name>
        <value>2.1</value>
    </property>
    <property>
        <name>yarn.resourcemanager.scheduler.class</name>
        <value>org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler</value>
        <description>指定公平调度器</description>
    </property>
    <property>
        <name>yarn.scheduler.fair.allocation.file</name>
        <value>/usr/local/hadoop-3.4.0/etc/hadoop/fair-scheduler.xml</value>
        <description>指明公平调度器队列分配配置文件</description>
    </property>
    <property>
        <name>yarn.scheduler.fair.preemption</name>
        <value>false</value>
        <description>禁止队列间资源抢占</description>
    </property>
</configuration>

 2.2.2、fair-scheduler.xml
<?xml version="1.0"?>
<allocations>
    <!-- 单个队列中Application Master占用资源的最大比例,取值0-1 ,企业一般配置0.1 -->
    <queueMaxAMShareDefault>0.5</queueMaxAMShareDefault>
    <!-- 单个队列最大资源的默认值 test default -->
    <queueMaxResourcesDefault>4096mb,4vcores</queueMaxResourcesDefault>
    <!-- 增加一个队列test -->
    <queue name="test">
        <!-- 队列最小资源 -->
        <minResources>2048mb,2vcores</minResources>
        <!-- 队列最大资源 -->
        <maxResources>4096mb,4vcores</maxResources>
        <!-- 队列中最多同时运行的应用数,默认50,根据线程数配置 -->
        <maxRunningApps>4</maxRunningApps>
        <!-- 队列中Application Master占用资源的最大比例 -->
        <maxAMShare>0.5</maxAMShare>
        <!-- 该队列资源权重,默认值为1.0 -->
        <weight>1.0</weight>
        <!-- 队列内部的资源分配策略 -->
        <schedulingPolicy>fair</schedulingPolicy>
    </queue>
    <!-- 增加一个队列demo -->
    <queue name="demo" type="parent">
        <!-- 队列最小资源 -->
        <minResources>2048mb,2vcores</minResources>
        <!-- 队列最大资源 -->
        <maxResources>4096mb,4vcores</maxResources>
        <!-- 队列中最多同时运行的应用数,默认50,根据线程数配置 -->
        <maxRunningApps>4</maxRunningApps>
        <!-- 该队列资源权重,默认值为1.0 -->
        <weight>1.0</weight>
        <!-- 队列内部的资源分配策略 -->
        <schedulingPolicy>fair</schedulingPolicy>
    </queue>
    <!-- 任务队列分配策略,可配置多层规则,从第一个规则开始匹配,直到匹配成功 -->
    <queuePlacementPolicy>
        <!-- 提交任务时指定队列,如未指定提交队列,则继续匹配下一个规则; false表示:如果指定队列不存在,不允许自动创建-->
        <rule name="specified" create="false"/>
        <!-- 提交到root.group.username队列,若root.group不存在,不允许自动创建;若root.group.user不存在,允许自动创建 -->
        <rule name="nestedUserQueue" create="true">
            <rule name="primaryGroup" create="false"/></rule>
        <!-- 最后一个规则必须为reject或者default。Reject表示拒绝创建提交失败,default表示把任务提交到default队列 -->
        <rule name="reject" />
    </queuePlacementPolicy>
</allocations>

 2.3、测试

hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.0.jar pi -Dmapreduce.job.queuename=root.test 1 1

#不指定队列,默认会创建以当前用户名为名称的队列执行任务
hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.0.jar pi 1 1

hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-3.4.0.jar wordcount -D mapreduce.job.queuename=root.test /input  /output11

四、Yarn的Tool接口案例

1、代码

package com.xiaojie.hadoop.mapreduce.yarntool.count;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;

import java.io.IOException;

public class WordCount implements Tool {

    private Configuration conf;

    @Override
    public int run(String[] args) throws Exception {

        Job job = Job.getInstance(conf);

        job.setJarByClass(WordCountDriver.class);

        job.setMapperClass(WordCountMapper.class);
        job.setReducerClass(WordCountReducer.class);

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));

        return job.waitForCompletion(true) ? 0 : 1;
    }

    @Override
    public void setConf(Configuration conf) {
        this.conf = conf;
    }

    @Override
    public Configuration getConf() {
        return conf;
    }

    public static class WordCountMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

        private Text outK = new Text();
        private IntWritable outV = new IntWritable(1);

        @Override
        protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, IntWritable>.Context context) throws IOException, InterruptedException {
            String line = value.toString();
            String[] words = line.split(" ");

            for (String word : words) {
                outK.set(word);

                context.write(outK, outV);
            }
        }
    }

    public static class WordCountReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
        private IntWritable outV = new IntWritable();

        @Override
        protected void reduce(Text key, Iterable<IntWritable> values, Reducer<Text, IntWritable, Text, IntWritable>.Context context) throws IOException, InterruptedException {
            int sum = 0;
            for (IntWritable value : values) {
                sum += value.get();
            }
            outV.set(sum);
            context.write(key, outV);
        }
    }
}
package com.xiaojie.hadoop.mapreduce.yarntool.count;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

import java.util.Arrays;

public class WordCountDriver {

    private static Tool tool;

    public static void main(String[] args) throws Exception {
        // 1. 创建配置文件
        Configuration conf = new Configuration();

        // 2. 判断是否有tool接口
        switch (args[0]) {
            case "wordcount":
                tool = new WordCount();
                break;
            default:
                throw new RuntimeException(" No such tool: " + args[0]);
        }
        // 3. 用Tool执行程序
        // Arrays.copyOfRange 将老数组的元素放到新数组里面
        int run = ToolRunner.run(conf, tool, Arrays.copyOfRange(args, 1, args.length));

        System.exit(run);
    }
}

2、测试

yarn jar yarn-demo.jar com.xiaojie.hadoop.mapreduce.yarntool.count.WordCountDriver wordcount -Dmapreduce.job.queuename=root.test /input /output222

3、完整代码

spring-boot: Springboot整合redis、消息中间件等相关代码 - Gitee.com

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/2271067.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

python修改ppt中的文字部分及插入图片

批量修改ppt中的某个模块&#xff0c;或者批量制作奖状等场景会用到&#xff1b; import os import pandas as pd from pptx import Presentation from pptx.util import Inchesfilepath/Users/kangyongqing/Documents/kangyq/202303/分析模版/批量制作/file1时段预警_副本.pp…

数据库新建用户后(Host:%),报错:localhost无法连接

存在问题 在给数据库&#xff08;MySQL、MariaDB等&#xff09;创建了新的用户名&#xff08;eg&#xff1a;maxscale&#xff09;后&#xff0c;无法使用新用户名登录&#xff0c;并报如下错误&#xff1a;ERROR 1045 (28000): Access denied for user maxscalelocalhost (us…

《机器学习》——逻辑回归(下采样)

文章目录 什么是下采样&#xff1f;为什么在逻辑回归中要使用下采样&#xff1f;使用下采样和不使用下采样的区别实例1、实例内容2、实例步骤 什么是下采样&#xff1f; 下采样&#xff08;Down - Sampling&#xff09;是一种数据处理技术&#xff0c;主要用于处理数据集中不同…

ACM算法模板

ACM算法模板 起手式基础算法前缀和与差分二分查找三分查找求极值分治法&#xff1a;归并排序 动态规划基本线性 d p dp dp最长上升子序列I O ( n 2 ) O(n ^ 2) O(n2)最长上升子序列II O ( n l o g n ) O(nlogn) O(nlogn) 贪心二分最长公共子序列 背包背包求组合种类背包求排列…

Scala_【5】函数式编程

第五章 函数式编程函数和方法的区别函数声明函数参数可变参数参数默认值 函数至简原则匿名函数高阶函数函数作为值传递函数作为参数传递函数作为返回值 函数闭包&柯里化函数递归控制抽象惰性函数友情链接 函数式编程 面向对象编程 解决问题时&#xff0c;分解对象&#xff…

CSS 学习之正确看待 CSS 世界里的 margin 合并

一、什么是 margin 合并 块级元素的上外边距(margin-top)与下外边距(margin-bottom)有时会合并为单个外边距&#xff0c;这样的现象称为“margin 合并”。从此定义上&#xff0c;我们可以捕获两点重要的信息。 块级元素&#xff0c;但不包括浮动和绝对定位元素&#xff0c;尽…

Golang的代码质量分析工具

Golang的代码质量分析工具 一、介绍 作为一种高效、简洁、可靠的编程语言&#xff0c;被越来越多的开发者所喜爱和采用。而随着项目规模的增长和团队人员的扩大&#xff0c;代码质量的管理变得尤为重要。为了保障代码的可维护性、健壮性和可扩展性&#xff0c;我们需要借助代码…

鸿蒙元服务 口袋管家(从0到1) ——准备工作

达到的效果图 如何创建元服务&#xff1f; 如下&#xff1a; 鸿蒙如何创建元服务-元服务是什么&#xff1f;和App的关系&#xff1f;&#xff08;保姆级步骤&#xff09;_鸿蒙元服务-CSDN博客 开始创建包 Bill 里面创建两个page页面 分别是 BillAddPage 和 BillIndexPag…

轻量型web组态软件

体验地址&#xff1a;http://www.hcy-soft.com 随着互联网、物联网技术的快速发展&#xff0c;BY组态基于多年研发积累和私有部署实践打磨、以及对业务场景的深入理解&#xff0c;推出了适用于物联网应用场景的轻量型web组态软件。 该产品采用 B/S 架构&#xff0c;提供 web …

Linux C/C++编程-获得套接字地址、主机名称和主机信息

【图书推荐】《Linux C与C一线开发实践&#xff08;第2版&#xff09;》_linux c与c一线开发实践pdf-CSDN博客《Linux C与C一线开发实践&#xff08;第2版&#xff09;&#xff08;Linux技术丛书&#xff09;》(朱文伟&#xff0c;李建英)【摘要 书评 试读】- 京东图书 (jd.com…

SweetAlert2 - 漂亮可定制的 JavaScript 弹窗

https://sweetalert2.github.io/ https://github.com/sweetalert2/sweetalert2 安装&#xff1a; npm install sweetalert2封装&#xff1a; import Swal from sweetalert2/dist/sweetalert2.js import sweetalert2/src/sweetalert2.scss/*** * param {string} icon - ico…

Android布局layout的draw简洁clipPath实现圆角矩形布局,Kotlin

Android布局layout的draw简洁clipPath实现圆角矩形布局&#xff0c;Kotlin 通常&#xff0c;如果要把一个相对布局&#xff0c;FrameLayout&#xff0c;或者线性布局等这样的布局变成具有圆角或者圆形的布局&#xff0c;需要增加一个style&#xff0c;给它设置圆角&#xff0c;…

PHP如何删除数组中的特定值?

php 中删除数组特定值的方法有三种&#xff1a;unset()&#xff1a;直接删除指定索引的值&#xff0c;但会保留数组索引结构和未删除元素&#xff0c;适合小数组。array_filter()&#xff1a;根据自定义回调函数筛选数组元素&#xff0c;返回一个新数组&#xff0c;原数组不变&…

啤酒风味塑造的关键因素——麦汁煮沸

在探索啤酒酿造的工艺过程中&#xff0c;我们发现每一个细微的步骤都对最终的口感和风味产生着不可忽视的影响。今天&#xff0c;让我们深入探讨一个关键环节——麦汁煮沸&#xff0c;以及其中至关重要的概念“煮沸强度”。 何谓煮沸强度&#xff1f;它又如何左右麦汁的品质&a…

unity开发之shader 管道介质流动特效

效果 shader graph 如果出现下面的效果&#xff0c;那是因为你模型的问题&#xff0c;建模做贴图的时候没有设置好UV映射&#xff0c;只需重新设置下映射即可

JAVA学习笔记_JVM

文章目录 初识jvm内存结构程序计数器(寄存器) 栈问题辨析内存溢出 线程诊断本地方法栈Heap堆内存溢出内存诊断 方法区内存溢出常量池 stringTable直接内存垃圾回收 初识jvm JRE JVM 基础类库 JDK JRE 编译工具 JavaSE JDK IDE工具 JavaEE JDK 应用服务器 IDE工具 jvm是…

供需平台信息发布付费查看小程序系统开发方案

供需平台信息发布付费查看小程序系统主要是为了满足个人及企业用户的供需信息发布与匹配需求。 一、目标用户群体 个人用户&#xff1a;寻找兼职工作、二手物品交换、本地服务&#xff08;如家政、维修&#xff09;等。 小微企业&#xff1a;推广产品和服务&#xff0c;寻找合…

牛客网刷题 ——C语言初阶——OR76 两个整数二进制位不同个数

1. 牛客网题目&#xff1a;OR76 两个整数二进制位不同个数 牛客网OJ链接 描述&#xff1a; 输入两个整数&#xff0c;求两个整数二进制格式有多少个位不同 输入描述&#xff1a;两个整数 输出描述&#xff1a;二进制不同位的个数 示例1 输入&#xff1a;22 33 输出&#xff1a…

直播美颜SDK深度优化技术探索:低延迟与高画质的平衡之道

本篇文章&#xff0c;小编将从技术角度出发&#xff0c;探讨直播美颜SDK的优化方法&#xff0c;探索实现低延迟与高画质并存的解决方案。 一、低延迟的技术挑战与应对策略 直播的核心在于实时互动&#xff0c;任何超过100ms的延迟都会显著影响用户体验。而美颜处理由于涉及复…

链表算法篇——链接彼岸,流离节点的相遇之诗(下)

文章目录 前言第一章&#xff1a;重排链表1.1 题目链接&#xff1a;https://leetcode.cn/problems/reorder-list/description/1.2 题目分析&#xff1a;1.3 思路讲解&#xff1a;1.4 代码实现&#xff1a; 第二章&#xff1a;合并K个升序链表2.1 题目链接&#xff1a;https://l…