Spark基础 - 名词汇总及集群模式概述

news2025/4/4 15:21:49

原文地址： Spark基础 - 名词汇总及集群模式概述

本文档内容参考Spark官方文档：Cluster Mode Overview

一. Glossary(术语)

Term	Meaning	comment
Application	User program built on Spark. Consists of a driver program and executors on the cluster.	构建在Spark上的用户程序。包含1个dirver和多个executor。
Application jar	A jar containing the user’s Spark application. In some cases users will want to create an “uber jar” containing their application along with its dependencies. The user’s jar should never include Hadoop or Spark libraries, however, these will be added at runtime.	包含用户Spark应用的Jar包。用户的Jar包不要包含Hadoop和Spark的依赖，因为这些依赖会在运行的时候添加。
Driver program	The process running the main() function of the application and creating the SparkContext	运行在main函数里的程序，并且创建了SparkContext。直白的说就是用户的代码。
Cluster manager	An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN, Kubernetes)	用于获取集群资源的外部服务。最常见的例如YARN、k8s。
Deploy mode	Distinguishes where the driver process runs. In “cluster” mode, the framework launches the driver inside of the cluster. In “client” mode, the submitter launches the driver outside of the cluster.	区分diver程序运行在哪里，有两种模式：cluster和client。 cluster: 在集群内启动程序。 client: 在集群外启动程序。
Worker node	Any node that can run application code in the cluster	集群内运行应用程序的节点。
Executor	A process launched for an application on a worker node, that runs tasks and keeps data in memory or disk storage across them. Each application has its own executors.	一个启动在Worker node上的进程，它用来运行tasks，并且把运行时的数据保存在内存或磁盘。每一个application拥有自己的executors。
Task	A unit of work that will be sent to one executor	一个Executor的可执行的逻辑单元。
Job	A parallel computation consisting of multiple tasks that gets spawned in response to a Spark action (e.g. `save`, `collect`); you’ll see this term used in the driver’s logs.	以action算子为边界，一个action触发一个job。
Stage	Each job gets divided into smaller sets of tasks called stages that depend on each other (similar to the map and reduce stages in MapReduce); you’ll see this term used in the driver’s logs.	一个job会被分成1个或多个stage，一个stage是有一组并行的task构成。

二. Cluster Mode Overview(集群模式概述)

通过下面文档能够简单的理解Spark所涉及的相关组件是如何在Spark集群上运行。

1. Components(组件)

Spark应用程序由主程序(driver program) 中的SparkContext对象协调，然后作为集群上独立的进程集运行。

在集群上运行Spark程序时，SparkContext能够连接多种类型的cluster manager（例如Spark自己的standalone集群、Mesos、YARN或者k8s）来申请应用程序所需的资源。

和相关cluster manager连接后，Spark在集群的节点上能够执行多个 executor （执行业务逻辑的计算和存储业务逻辑中的数据）。

然后，Cluster Manager把application code发送给executor.

最后，SparkContext把task交给executor来运行。

总结：SparkContext通过Cluster Manager在Cluster Manager对应类型的集群上申请资源并启动多个Executor。申请到资源后，SparkContext把需要执行的Spark任务发送到Executor内进行运行。
Spark架构
这套体系里，有以下几处需要注意：

每个应用程序都有自己的executor进程，这些进程在整个应用程序执行期间保持运行，并在多个线程中运行任务。这样做的好处是在scheduling端(每个驱动程序调度自己的任务)和executor端(来自不同应用程序的任务运行在不同的jvm中)将应用程序彼此隔离。然而，这也意味着如果不将数据写入外部存储系统，就不能在不同的Spark应用程序（SparkContext实例）之间共享数据；
Spark对cluster manager是无感知的。只要Spark能获取executor processes，并且executor process之间是可以相互通信的，这就会使得Spark更容易的运行在其它类型的cluster manager上（例如 Mesos/YARN/Kubernetes）；
Driver程序在其整个存活期间必须监听并接受来自它的executor的连接（比如：请参见 spark.driver.port in the network config section ）。因此，driver程序必须与worker节点之间网络是可连接的；
因为driver程序在集群上调度任务，所以driver程序应该靠近worker node运行的机器，最好是在同一个局域网上。如果需要向远程集群提交任务，最好在driver上开启RPC，并让开启RPC服务的driver就近提交这屋，而不是在远离worker node的地方运行driver。