spark兼容性验证

news2024/11/13 10:11:36

前言

Apache Spark是专门为大规模数据处理而设计的快速通用的计算引擎,Spark拥有Hadoop MapReduce所具有的优点,但不同于Mapreduce的是Job中间输出结果可以保存在内存中,从而不再需要读写HDFS,因此Spark能更好的适用于数据挖掘与机器学习等需要迭代的MapReduce。

Spark是一种与hadoop相似的开源集群计算环境,但是两者之间还存在一些不同之处,Spark启用了内存分布数据集群,除了能够提供交互式查询外,它还可以优化迭代工作负载。

Spark特点:
1、更快的速度:内存计算下,Spark比Hadoop快100倍
2、易用性:可以使用java、scala、python、R和SQL语言进行spark开发,Spark提供了80多个高级运算符
3、通用性:Spark提供了大量的库,包括Spark Core、Spark SQL、Spark Streaming、MLlib、GraphX。开发者可以在一个应用程序中无缝组合使用这些库
4、多种运行环境:Spark可在Hadoop、Apache Mesos、Kubernetes、standalone或其他云环境上运行

参考链接:https://blog.csdn.net/cuiyaonan2000/article/details/116048663

spark适用场景:
1、spark是基于内存的迭代计算,适合多次操作特定数据集的场合
2、数据量不是特别大,但要求实时统计分析需求
3、不适用异步细粒更新状态的应用,如web服务器存储、增量的web爬虫和索引

spark运行模式:
1、可以运行在一台机器上,称为Local(本地)运行模式
2、可以使用spark自带的资源调度系统,称为Standalone模式
3、可以使用Yarn、Mesos、kubernetes作为底层资源调度系统,称为Spark On Yarn、Spark On Mesos、Spark On K8s

参考链接:https://blog.csdn.net/jiayi_yao/article/details/125545826#t8

一、安装启动

安装spark及其依赖
yum install java-1.8.0-openjdk curl tar python3
mkdir -p /usr/local/spark
cd /usr/local/spark
wget https://mirrors.aliyun.com/apache/spark/spark-3.3.2/spark-3.3.2-bin-hadoop3.tgz
tar -xvf spark-3.3.2-bin-hadoop3.tgz

  • 启动spark-master
    cd /usr/local/spark/spark-3.3.2-bin-hadoop3/
    [root@bogon spark-3.3.2-bin-hadoop3]# ./sbin/start-master.sh
    可以看到类似如下的输出:
    starting org.apache.spark.deploy.master.Master, logging to /usr/local/spark/spark-3.3.2-bin-hadoop3/logs/spark-root-org.apache.spark.deploy.master.Master-1-bogon.out
    用tail命令查看执行日志
[root@bogon spark-3.3.2-bin-hadoop3]# tail logs/spark-root-org.apache.spark.deploy.master.Master-1-bogon.out 
23/03/06 14:35:44 INFO SecurityManager: Changing modify acls to: root
23/03/06 14:35:44 INFO SecurityManager: Changing view acls groups to: 
23/03/06 14:35:44 INFO SecurityManager: Changing modify acls groups to: 
23/03/06 14:35:44 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
23/03/06 14:35:45 INFO Utils: Successfully started service 'sparkMaster' on port 7077.
23/03/06 14:35:45 INFO Master: Starting Spark master at spark://bogon:7077
23/03/06 14:35:45 INFO Master: Running Spark version 3.3.2
23/03/06 14:35:45 INFO Utils: Successfully started service 'MasterUI' on port 8080.
23/03/06 14:35:45 INFO MasterWebUI: Bound MasterWebUI to 0.0.0.0, and started at http://bogon:8080
23/03/06 14:35:46 INFO Master: I have been elected leader! New state: ALIVE
  • 启动spark-worker
    cd /usr/local/spark/spark-3.3.2-bin-hadoop3/
    [root@bogon spark-3.3.2-bin-hadoop3]# ./sbin/start-worker.sh spark://bogon:7077
    可以看到类似如下的输出:
    starting org.apache.spark.deploy.worker.Worker, logging to /usr/local/spark/spark-3.3.2-bin-hadoop3/logs/spark-root-org.apache.spark.deploy.worker.Worker-1-bogon.out
    用tail命令查看执行日志
[root@bogon spark-3.3.2-bin-hadoop3]# tail logs/spark-root-org.apache.spark.deploy.worker.Worker-1-bogon.out 
23/03/06 14:52:17 INFO Worker: Spark home: /usr/local/spark/spark-3.3.2-bin-hadoop3
23/03/06 14:52:17 INFO ResourceUtils: ==============================================================
23/03/06 14:52:17 INFO ResourceUtils: No custom resources configured for spark.worker.
23/03/06 14:52:17 INFO ResourceUtils: ==============================================================
23/03/06 14:52:17 WARN Utils: Service 'WorkerUI' could not bind on port 8081. Attempting port 8082.
23/03/06 14:52:17 INFO Utils: Successfully started service 'WorkerUI' on port 8082.
23/03/06 14:52:17 INFO WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http://bogon:8082
23/03/06 14:52:17 INFO Worker: Connecting to master bogon:7077...
23/03/06 14:52:17 INFO TransportClientFactory: Successfully created connection to bogon/10.130.0.73:7077 after 85 ms (0 ms spent in bootstraps)
23/03/06 14:52:18 INFO Worker: Successfully registered with master spark://bogon:7077

二、测试

  • 通过http://$IP:8080 访问master页面,获取资源消耗等的摘要信息
    如果部署成功,我们可以看到类似如下的返回结果
    spark-master
  • 也可以通过web浏览器请求对应服务器8082端口(默认8081端口,可通过日志查看具体的端口号),查看worker的基本情况
    spark-worker
  • 提交测试任务,计算pi,提交命令如下
    ./bin/spark-submit --master spark://bogon:7077 examples/src/main/python/pi.py 1000
    可以看到类似如下的输出:
 23/03/06 16:32:55 INFO TaskSetManager: Starting task 999.0 in stage 0.0 (TID 999) (10.130.0.73, executor 0, partition 999, PROCESS_LOCAL, 4437 bytes) taskResourceAssignments Map()
23/03/06 16:32:55 INFO TaskSetManager: Finished task 995.0 in stage 0.0 (TID 995) in 217 ms on 10.130.0.73 (executor 0) (996/1000)
23/03/06 16:32:55 INFO TaskSetManager: Finished task 996.0 in stage 0.0 (TID 996) in 220 ms on 10.130.0.73 (executor 0) (997/1000)
23/03/06 16:32:55 INFO TaskSetManager: Finished task 997.0 in stage 0.0 (TID 997) in 198 ms on 10.130.0.73 (executor 0) (998/1000)
23/03/06 16:32:55 INFO TaskSetManager: Finished task 998.0 in stage 0.0 (TID 998) in 189 ms on 10.130.0.73 (executor 0) (999/1000)
23/03/06 16:32:55 INFO TaskSetManager: Finished task 999.0 in stage 0.0 (TID 999) in 238 ms on 10.130.0.73 (executor 0) (1000/1000)
23/03/06 16:32:55 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 
23/03/06 16:32:55 INFO DAGScheduler: ResultStage 0 (reduce at /usr/local/spark/spark-3.3.2-bin-hadoop3/examples/src/main/python/pi.py:42) finished in 64.352 s
23/03/06 16:32:55 INFO DAGScheduler: Job 0 is finished. Cancelling potential speculative or zombie tasks for this job
23/03/06 16:32:55 INFO TaskSchedulerImpl: Killing all running tasks in stage 0: Stage finished
23/03/06 16:32:55 INFO DAGScheduler: Job 0 finished: reduce at /usr/local/spark/spark-3.3.2-bin-hadoop3/examples/src/main/python/pi.py:42, took 64.938378 s
Pi is roughly 3.133640
23/03/06 16:32:55 INFO SparkUI: Stopped Spark web UI at http://bogon:4040
23/03/06 16:32:55 INFO StandaloneSchedulerBackend: Shutting down all executors
23/03/06 16:32:55 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
23/03/06 16:32:55 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
23/03/06 16:32:55 INFO MemoryStore: MemoryStore cleared
23/03/06 16:32:55 INFO BlockManager: BlockManager stopped
23/03/06 16:32:55 INFO BlockManagerMaster: BlockManagerMaster stopped
23/03/06 16:32:55 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
23/03/06 16:32:55 INFO SparkContext: Successfully stopped SparkContext
23/03/06 16:32:56 INFO ShutdownHookManager: Shutdown hook called
23/03/06 16:32:56 INFO ShutdownHookManager: Deleting directory /tmp/spark-087c8a21-8641-4b20-8b65-be47b77f26c5
23/03/06 16:32:56 INFO ShutdownHookManager: Deleting directory /tmp/spark-24d1bc1a-841a-435e-8263-ad891e2aaa97/pyspark-f67ad2cb-ec86-41eb-ae7c-8fcb46e66827
23/03/06 16:32:56 INFO ShutdownHookManager: Deleting directory /tmp/spark-24d1bc1a-841a-435e-8263-ad891e2aaa97

可以在网页界面看到运行结果

也可以查看执行日志

[root@bogon spark-3.3.2-bin-hadoop3]# tail logs/spark-root-org.apache.spark.deploy.worker.Worker-1-bogon.out 
23/03/06 16:31:48 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
23/03/06 16:31:48 INFO ExecutorRunner: Launch command: "/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.312.b07-8.1.10.lns8.loongarch64/jre/bin/java" "-cp" "/usr/local/spark/spark-3.3.2-bin-hadoop3/conf/:/usr/local/spark/spark-3.3.2-bin-hadoop3/jars/*" "-Xmx1024M" "-Dspark.driver.port=46821" "-XX:+IgnoreUnrecognizedVMOptions" "--add-opens=java.base/java.lang=ALL-UNNAMED" "--add-opens=java.base/java.lang.invoke=ALL-UNNAMED" "--add-opens=java.base/java.lang.reflect=ALL-UNNAMED" "--add-opens=java.base/java.io=ALL-UNNAMED" "--add-opens=java.base/java.net=ALL-UNNAMED" "--add-opens=java.base/java.nio=ALL-UNNAMED" "--add-opens=java.base/java.util=ALL-UNNAMED" "--add-opens=java.base/java.util.concurrent=ALL-UNNAMED" "--add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED" "--add-opens=java.base/sun.nio.ch=ALL-UNNAMED" "--add-opens=java.base/sun.nio.cs=ALL-UNNAMED" "--add-opens=java.base/sun.security.action=ALL-UNNAMED" "--add-opens=java.base/sun.util.calendar=ALL-UNNAMED" "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler@bogon:46821" "--executor-id" "0" "--hostname" "10.130.0.73" "--cores" "4" "--app-id" "app-20230306163148-0000" "--worker-url" "spark://Worker@10.130.0.73:35787"
23/03/06 16:32:55 INFO Worker: Asked to kill executor app-20230306163148-0000/0
23/03/06 16:32:55 INFO ExecutorRunner: Runner thread for executor app-20230306163148-0000/0 interrupted
23/03/06 16:32:55 INFO ExecutorRunner: Killing process!
23/03/06 16:32:55 INFO Worker: Executor app-20230306163148-0000/0 finished with state KILLED exitStatus 143
23/03/06 16:32:55 INFO ExternalShuffleBlockResolver: Clean up non-shuffle and non-RDD files associated with the finished executor 0
23/03/06 16:32:55 INFO ExternalShuffleBlockResolver: Executor is not registered (appId=app-20230306163148-0000, execId=0)
23/03/06 16:32:55 INFO Worker: Cleaning up local directories for application app-20230306163148-0000
23/03/06 16:32:55 INFO ExternalShuffleBlockResolver: Application app-20230306163148-0000 removed, cleanupLocalDirs = true

清理环境

./sbin/stop-worker.sh
./sbin/stop-master.sh
rm -rf /usr/local/spark

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/391806.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

二分与三分(备赛中)

A.愤怒的牛儿 思路&#xff1a;找出最长距离&#xff0c;与最小距离&#xff0c;用二分法判断&#xff0c;如果当前距离满足放牛要求&#xff0c;就把距离区间lmid1;如果距离不合适就说明当前距离太大了&#xff0c;把区间变小rmid-1;最后直到l<r不满足时退出&#xff0c;输…

Spark SQL优化机制

Spark SQL优化机制Spark SQLCatalyst 优化器逻辑优化物理优化TungstenUnsafe RowWSCGRDD 缺点 : RDD的算子都是高阶函数 &#xff0c;Spark Core 不知函数内的操作&#xff0c;只能闭包形式发给 Executors&#xff0c; 无法优化 DataFrame 不同点&#xff1a; 数据的表示形式…

AQS底层源码深度剖析-Lock锁

目录 AQS底层源码深度剖析-Lock锁 ReentrantLock底层原理 为什么把获取锁失败的线程加入到阻塞队列中&#xff0c;而不是采取其它方法&#xff1f; 总结&#xff1a;三大核心原理 CAS是啥&#xff1f; 代码模拟一个CAS&#xff1a; 公平锁与非公平锁 可重入锁的应用场景&…

C语言-基础了解-25-C强制类型转换

C强制类型转换 一、强制类型转换 强制类型转换是把变量从一种类型转换为另一种数据类型。例如&#xff0c;如果您想存储一个 long 类型的值到一个简单的整型中&#xff0c;您需要把 long 类型强制转换为 int 类型。您可以使用强制类型转换运算符来把值显式地从一种类型转换为…

【深度学习】BERT变体—ALBERT

ALBERT的初衷是想解决BERT中参数量过多的问题&#xff0c;论文全称为&#xff1a;ALBERT: A Lite BERT for Self-supervised Learning of Language Representations。 相较于BERT&#xff0c;ALBERT模型减少BERT模型的参数量&#xff1b;预训练中的Next Sentence Prediction&a…

【面试系列】线程相关的面试题集锦

线程的状态 public enum State {/*** Thread state for a thread which has not yet started.*/NEW,/*** Thread state for a runnable thread. A thread in the runnable* state is executing in the Java virtual machine but it may* be waiting for other resources from …

最简单的线性回归模型-标量

首先考虑yyy为标量&#xff0c;www为标量的情况&#xff0c;那么我们的线性函数为ywxbywxbywxb。每批输入的量batch size 为111&#xff0c;每批输入的xxx为一个标量&#xff0c;设为x∗x^*x∗&#xff0c;标签yyy同样为一个标量&#xff0c;设为y∗y^*y∗。因此每批训练的损失…

直线模组的优势是什么?

直线模组是可以模拟人工操作的一些功能&#xff0c;通过固定程序来进行抓取&#xff0c;搬运、操作工具&#xff0c;实现自动变速&#xff0c;这也是为何直线模组使用率高的主要原因了&#xff0c;那么直线模组究竟有着怎样的优势呢&#xff1f; 1、整体结构紧凑&#xff0c;重…

k8s-Kubernetes集群部署

文章目录前言一、Kubernetes简介与架构1.Kubernetes简介2.kubernetes设计架构二、Kubernetes集群部署1.集群环境初始化2.所有节点安装kubeadm3.拉取集群所需镜像3.集群初始化4.安装flannel网络插件5.扩容节点6.设置kubectl命令补齐前言 一、Kubernetes简介与架构 1.Kubernetes…

Spark 磁盘作用

Spark 磁盘作用磁盘作用性能价值失败重试ReuseExchangeSpark 导航 磁盘作用 临时文件、中间文件、缓存数据&#xff0c;都会存储到 spark.local.dir 中 在 Shuffle Map 时&#xff0c; 当内存空间不足&#xff0c;就会溢出临时文件存储到磁盘上溢出的临时文件一起做归并计算…

Vue3---语法初探

目录 hello world 实现简易计时显示 反转字符串 显示隐藏 了解循环 了解双向绑定实现简易记事 设置鼠标悬停的文本 组件概念初探&#xff0c;进行组件代码拆分 hello world 最原始形态&#xff0c;找到 id 为 root 的标签&#xff0c;将 Vue 实例的模板放入标签之内 …

剑指 Offer 09. 用两个栈实现队列(java)

用两个栈实现一个队列。队列的声明如下&#xff0c;请实现它的两个函数 appendTail 和 deleteHead &#xff0c;分别完成在队列尾部插入整数和在队列头部删除整数的功能。(若队列中没有元素&#xff0c;deleteHead 操作返回 -1 ) 示例 1&#xff1a; 输入&#xff1a; [“CQu…

SpringBoot中一行代码解决字符串向枚举类型转换的问题

1. 场景 在WEB开发&#xff0c;客户端和服务端传输的数据中经常包含一些这样的字段&#xff1a;字段的值只包括几个固定的字符串。 这样的字段意味着我们需要在数据传输对象&#xff08;Data Transfer Object, DTO&#xff09;中对该字段进行校验以避免客户端传输的非法数据持…

Android Service知识

一. 概览 Service 是一种可在后台执行长时间运行操作而不提供界面的应用组件。服务可由其他应用组件启动&#xff0c;而且即使用户切换到其他应用&#xff0c;服务仍将在后台继续运行。此外&#xff0c;组件可通过绑定到服务与之进行交互&#xff0c;甚至是执行进程间通信 (IPC…

你是真的“C”——为冒泡排序升级赋能!

你是真的“C”——为冒泡排序升级赋能&#xff01;&#x1f60e;前言&#x1f64c;冒泡排序升级赋能之境界一&#xff01;冒泡排序升级赋能之境界二&#xff01;qsort库函数的运用和认识总结撒花&#x1f49e;&#x1f60e;博客昵称&#xff1a;博客小梦 &#x1f60a;最喜欢的…

【CDP】更改solr 存储路径导致ranger-audit 大量报错问题解决

前言 我们生产上公司是使用的CDP集群&#xff0c;一次管理员通知&#xff0c;Solr 组件的数据存放路径磁盘空间不够。 我们的solr 组件时为 Ranger 服务提供日志审计功能&#xff0c; 在我们更改了磁盘路径&#xff0c;并重启了Solr 组件&#xff0c;然后发现相关组件&#…

基于Python的selenium

一、安装 1.1安装Python&#xff0c;安装Python时需要勾选增加环境变量 如果之前已经安装过Python&#xff0c;需要将Python相关文件以及环境变量删除 1.2安装成功&#xff1a;在命令行界面下输入Python&#xff0c;最终展示>>>即可成功 2.1安装pycharm,直接自定义安装…

论文阅读-(GLIP)Grounded Language-Image Pre-training (目标检测+定位)

Paper&#xff1a;Grounded Language-Image Pre-training Code&#xff1a;https://github.com/microsoft/GLIP 简介&#xff1a; 定位任务与图像检测任务非常类似&#xff0c;都是去图中找目标物体的位置&#xff0c;目标检测为给出一张图片找出bounding box&#xff0c;定位…

07react+echart,大屏代码开发

react框架引入第三方插件原链接gitHub:GitHub - hustcc/echarts-for-react: ⛳ Apache ECharts components for React wrapper. 一个简单的 Apache echarts 的 React 封装。import ReactECharts from echarts-for-react;import * as echarts from echarts;一、软件简介echarts-…

微机原理和计算机组成原理复习

1&#xff1a;冯诺依曼机器的主要特点&#xff1f; 1&#xff09;计算机由运算器、存储器、控制器、输入设备和输出设备五大部分组成&#xff1b; 2&#xff09;指令和数据存储在存储器中&#xff0c;并可以按地址访问&#xff1b; 3&#xff09;指令和数据均以二进制表示&…