Spark3.3.0的DataFrame及Spark SQL编程的性能对比【单机模式下】
前言
Spark3.3.0较老早的2.4.0有极大的性能优化,尤其是对SQL做了大量的优化【数据倾斜等】,恰好近期遇到一些性能问题,特意写个Demo测试下DataFrame和Spark SQL在获取到相同结果时的性能表现。
机器配置
参照:https://lizhiyong.blog.csdn.net/article/details/127827522
主板:x99f8d
CPU:e5 2696v3 *2 【36核72线程】
内存条:DDR4 ECC 32G *8 【256G】
显卡:RTX A4000 【16G显存】
散热器:ta 120ex *2 【单风扇】
SSD:MX500 2T *2 【4T】
HDD:NAS拆的酷狼 4T
电源:GX1000 【1000W】
机箱:614PC 【标配2风扇+套装3风扇】
显示器:27寸 【4K】
键鼠:笔记本淘汰的一套 【USB口】
网卡:主板自带的网卡
比对原理
模拟SQL Boy们最喜欢的join操作,构造2个1e数据量的数据集。简单起见,主键id就是0->1e,字段内容就是0->1w,做inner join【其实主键相同的时候,各种join,比如 left join 、right join 、full join 都没啥关系】获取到新数据集【这么做是故意有shuffle操作】,再对宽表做遍历,就可以涵盖数仓开发中最常见的计算。
通过记录的时间戳,即可判断出性能优劣【毫无疑问,耗时短的性能更好】。
性能比对
Scala代码
package day20221215
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{DataTypes, StructField, StructType}
import org.slf4j.Logger
object SparkMock100wScala {
case class Data1(id: String, comment1: String)
def main(args: Array[String]): Unit = {
val spark: SparkSession = SparkSession.builder()
.appName("SparkMock100wScala")
.master("local[*]")
.enableHiveSupport()
.getOrCreate()
import org.apache.spark.sql._
import spark.implicits._
val logger: Logger = org.slf4j.LoggerFactory.getLogger(this.getClass)
logger.info("Spark构建完毕")
val num: Long = 10000 * 10000
val ds1: Dataset[Data1] = spark
.range(num)
.map(row => { //生成id 0->10w的DataSet
Data1(row.toString(), String.valueOf(Math.random() * 10000).split("\\.")(0)) //构造随机数,转case class
}
)
val ds2: Dataset[Data1] = spark
.range(num)
.map(row => { //生成id 0->10w的DataSet
Data1(row.toString(), String.valueOf(Math.random() * 10000).split("\\.")(0)) //构造随机数,转case class
}
)
// val structType: StructType = StructType(List(
// StructField("name", DataTypes.StringType),
// StructField("age", DataTypes.StringType)
// ))
//spark.createDataFrame(ds1.toDF().rdd,structType)
val df1: DataFrame = ds1.toDF()
var df2: DataFrame = ds2.toDF()
df2 = df2.withColumn("comment2", df2.col("comment1")).drop("comment1")
df1.cache()
df2.cache()
//df1.show(15)
//df2.show(15)
val df_startTime: Long = System.currentTimeMillis()
val df3: DataFrame = df1.join(df2, Seq("id"), "inner") //根据100000000个id主键去inner join
//df3.toDF().show(15)
val count1: Long = df3.filter("comment1>comment2").count()
val count2: Long = df3.filter("comment1<comment2").count()
val count3: Long = df3.filter("comment1=comment2").count()
val df_stopTime: Long = System.currentTimeMillis()
val count4: Long = count1 + count2 + count3
val df_time: Long = df_stopTime - df_startTime
println("comment1>comment2的个数:" + count1)
println("comment1<comment2的个数:" + count2)
println("comment1=comment2的个数:" + count3)
println("总数:" + count4)
println("DataFrame耗时:" + df_time)
logger.info("DataFrame运算完毕,耗时:" + df_time)
logger.error("暂停10s")
df1.createTempView("tmp1")
df2.createTempView("tmp2")
Thread.sleep(10 * 1000)
val sql1_startTime: Long = System.currentTimeMillis()
val sql1: String =
"""
select
col1,
col2,
col3,
col1+col2+col3 as col4
from(
select
sum(case when comment1>comment2 then 1 else 0 end) as col1,
sum(case when comment1<comment2 then 1 else 0 end) as col2,
sum(case when comment1=comment2 then 1 else 0 end) as col3
from(
select
t1.id,
t1.comment1,
t2.comment2
from
tmp1 t1
inner join
tmp2 t2
on t1.id=t2.id
) t3
)t4
;
"""
spark.sql(sql1).show()
val sql1_stopTime: Long = System.currentTimeMillis()
val sql1_time: Long = sql1_stopTime - sql1_startTime
println("Spark SQL1耗时:" + sql1_time)
Thread.sleep(10 * 1000)
val sql2_startTime: Long = System.currentTimeMillis()
val sql2: String =
"""
select
col1,
col2,
col3,
col1+col2+col3 as col4
from(
select
sum(col1) as col1,
sum(col2) as col2,
sum(col3) as col3
from(
select
count(1) as col1,
0 as col2,
0 as col3
from
tmp1 t1
inner join
tmp2 t2
on
t1.id=t2.id
where
t1.comment1>t2.comment2
union all
select
0 as col1,
count(1) as col2,
0 as col3
from
tmp1 t1
inner join
tmp2 t2
on
t1.id=t2.id
where
t1.comment1<t2.comment2
union all
select
0 as col1,
0 as col2,
count(1) as col3
from
tmp1 t1
inner join
tmp2 t2
on
t1.id=t2.id
where
t1.comment1=t2.comment2
) t3
) t4
;
"""
spark.sql(sql2).show()
val sql2_stopTime: Long = System.currentTimeMillis()
val sql2_time: Long = sql2_stopTime - sql2_startTime
println("Spark SQL2耗时:" + sql2_time)
logger.error("暂停1000s")
Thread.sleep(1000 * 1000)
spark.close()
}
}
整体还是相当简单的。而2句SQL也模拟了资深SQL Boy和肤浅的SQL Boy们常用的写法。
数据样例
//df1.show(15)
//df2.show(15)
/** 每次都会变,但是2个DataFrame不同
* +---+--------+
* | id|comment1|
* +---+--------+
* | 0| 5452|
* | 1| 6119|
* | 2| 11|
* | 3| 115|
* | 4| 9458|
* | 5| 9101|
* | 6| 1448|
* | 7| 9956|
* | 8| 3214|
* | 9| 3464|
* | 10| 9558|
* | 11| 1278|
* | 12| 1922|
* | 13| 6252|
* | 14| 1360|
* +---+--------+
* only showing top 15 rows
* */
很容易理解,这2个每次都会变化并且值一定不同的DataFrame就是长这样。
//df3.toDF().show(15)
// +---+--------+--------+
// | id|comment1|comment2|
// +---+--------+--------+
// | 0| 995| 9804|
// | 1| 2763| 2753|
// | 2| 1117| 1076|
// | 3| 2024| 4121|
// | 4| 975| 4454|
// | 5| 4556| 593|
// | 6| 7424| 1260|
// | 7| 346| 8986|
// | 8| 9750| 7065|
// | 9| 1323| 215|
// | 10| 5932| 1002|
// | 11| 7919| 8089|
// | 12| 1397| 16|
// | 13| 5119| 8322|
// | 14| 9728| 7019|
// +---+--------+--------+
// only showing top 15 rows
做完Join拉宽后长这样,也没啥好奇怪的。
运算过程
本来打算玩玩100w,发现数据量还是太少。故选择了1e体量。当数据量比较夸张且不能全部加载到内存时,就一定会有shuffle的情况。可以发现SSD跑Shuffle性能比HDD好很多。
Web UI
http://127.0.0.1:4040/
任务相当简单,其实也没太多好看的内容。
调整CPU核数
调整可调用的core时,CPU占用率会有些许改变:
也会影响到Shuffle。
Task情况
由于是串行,Catalyst优化器会做一些优化,可以看到部分Task被优化了。。。
Shuffle情况
可以看出Shuffle的数据量很大。
GC情况
可以看到默认情况下,Win10环境,local模式的Spark总共只分配了16G内存,其中3.7G用于存储【由于cache算子的action操作,其实都是堆内存】,峰值11.5G内存用于运算。Core也是双路e5 2696v3的36核,没啥子毛病。RDD块儿72个,正好是线程数,也没啥毛病。
峰值情况
在最后一个运算【也就是sql2】时,硬盘IO暴增。
比对结果
第一轮比对
第一轮比对,采用local[4],测试结果如下:
comment1>comment2的个数:49999895
comment1<comment2的个数:49990073
comment1=comment2的个数:10032
总数:100000000
DataFrame耗时:121412
22/12/15 23:48:49 INFO SparkMock100wScala$: DataFrame运算完毕,耗时:121412
22/12/15 23:48:49 ERROR SparkMock100wScala$: 暂停10s
+--------+--------+-----+---------+
| col1| col2| col3| col4|
+--------+--------+-----+---------+
|49999895|49990073|10032|100000000|
+--------+--------+-----+---------+
Spark SQL1耗时:20726
+--------+--------+-----+---------+
| col1| col2| col3| col4|
+--------+--------+-----+---------+
|49999895|49990073|10032|100000000|
+--------+--------+-----+---------+
Spark SQL2耗时:53207
22/12/15 23:50:23 ERROR SparkMock100wScala$: 暂停1000s
由于cache算子的action操作,保证了Scala的DataFrame算子操作和2个Spark SQL运算的DataFrame一致。可以看到获取到的结果也是一致的。DataFrame运行了121s,sql1运行了20s,sql2运行了53s。
第二轮比对
第二轮比对,采用local[*],测试结果如下:
comment1>comment2的个数:49991501
comment1<comment2的个数:49998452
comment1=comment2的个数:10047
总数:100000000
DataFrame耗时:132845
22/12/16 00:03:16 INFO SparkMock100wScala$: DataFrame运算完毕,耗时:132845
22/12/16 00:03:16 ERROR SparkMock100wScala$: 暂停10s
+--------+--------+-----+---------+
| col1| col2| col3| col4|
+--------+--------+-----+---------+
|49991501|49998452|10047|100000000|
+--------+--------+-----+---------+
Spark SQL1耗时:20541
+--------+--------+-----+---------+
| col1| col2| col3| col4|
+--------+--------+-----+---------+
|49991501|49998452|10047|100000000|
+--------+--------+-----+---------+
Spark SQL2耗时:52992
22/12/16 00:04:50 ERROR SparkMock100wScala$: 暂停1000s
可以看到,虽然分配的CPU算力更多了,但是可用RAM还是默认的16G,瓶颈就在硬盘IO上,故结果与第一轮一致。
第三轮比对
从Web UI看到存在Task有skip的情况【被优化掉了】。所以上述情况并不是一定十分准确。当屏蔽第一部分的DataFrame运算后,sql1就被打回原形了:
+--------+--------+-----+---------+
| col1| col2| col3| col4|
+--------+--------+-----+---------+
|49998005|49991893|10102|100000000|
+--------+--------+-----+---------+
Spark SQL1耗时:96027
+--------+--------+-----+---------+
| col1| col2| col3| col4|
+--------+--------+-----+---------+
|49998005|49991893|10102|100000000|
+--------+--------+-----+---------+
Spark SQL2耗时:54983
22/12/16 00:19:46 ERROR SparkMock100wScala$: 暂停1000s
可以看出sql1其实需要96s才能跑完,从结果来看比DataFrame运算有优势。但是DSL方式其实提交了3个运算任务,而sql1只有1个。
从另一方面,可以看出资深SQL Boy写的Spark SQL任务在Spark3.3.0Catalyst优化器加持下,性能有可能超越肤浅的程序猿。Spark3.3.0性能较之前强太多了。
sql2还是需要54s才能跑完,但是这个不一定准确。
第四轮对比
+--------+--------+----+---------+
| col1| col2|col3| col4|
+--------+--------+----+---------+
|49995581|49994453|9966|100000000|
+--------+--------+----+---------+
Spark SQL2耗时:114075
22/12/16 00:22:39 ERROR SparkMock100wScala$: 暂停1000s
单独运算sql2,可以看到耗时需要114s,和DataFrame运算相比都半斤八两。比起sql1的96s还是要慢一些。可以看出同样是SQL Boy,资深有资深的道理。
Log
22/12/16 00:01:44 INFO TaskSetManager: Starting task 31.0 in stage 1.0 (TID 67) (DESKTOP-VRV0NDO, executor driver, partition 31, PROCESS_LOCAL, 4567 bytes) taskResourceAssignments Map()
22/12/16 00:01:44 INFO TaskSetManager: Finished task 20.0 in stage 0.0 (TID 20) in 38681 ms on DESKTOP-VRV0NDO (executor driver) (32/36)
22/12/16 00:01:44 INFO Executor: Running task 31.0 in stage 1.0 (TID 67)
22/12/16 00:01:44 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 2242 bytes result sent to driver
22/12/16 00:01:44 INFO TaskSetManager: Starting task 32.0 in stage 1.0 (TID 68) (DESKTOP-VRV0NDO, executor driver, partition 32, PROCESS_LOCAL, 4567 bytes) taskResourceAssignments Map()
22/12/16 00:01:44 INFO Executor: Running task 32.0 in stage 1.0 (TID 68)
22/12/16 00:01:44 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 38763 ms on DESKTOP-VRV0NDO (executor driver) (33/36)
22/12/16 00:01:44 INFO Executor: Finished task 4.0 in stage 0.0 (TID 4). 2242 bytes result sent to driver
22/12/16 00:01:44 INFO TaskSetManager: Starting task 33.0 in stage 1.0 (TID 69) (DESKTOP-VRV0NDO, executor driver, partition 33, PROCESS_LOCAL, 4567 bytes) taskResourceAssignments Map()
22/12/16 00:01:44 INFO Executor: Running task 33.0 in stage 1.0 (TID 69)
22/12/16 00:01:44 INFO TaskSetManager: Finished task 4.0 in stage 0.0 (TID 4) in 39418 ms on DESKTOP-VRV0NDO (executor driver) (34/36)
22/12/16 00:01:44 INFO Executor: Finished task 22.0 in stage 0.0 (TID 22). 2242 bytes result sent to driver
22/12/16 00:01:44 INFO TaskSetManager: Starting task 34.0 in stage 1.0 (TID 70) (DESKTOP-VRV0NDO, executor driver, partition 34, PROCESS_LOCAL, 4567 bytes) taskResourceAssignments Map()
22/12/16 00:01:44 INFO Executor: Running task 34.0 in stage 1.0 (TID 70)
22/12/16 00:01:44 INFO TaskSetManager: Finished task 22.0 in stage 0.0 (TID 22) in 39511 ms on DESKTOP-VRV0NDO (executor driver) (35/36)
22/12/16 00:01:45 INFO Executor: Finished task 3.0 in stage 0.0 (TID 3). 2242 bytes result sent to driver
22/12/16 00:01:45 INFO TaskSetManager: Starting task 35.0 in stage 1.0 (TID 71) (DESKTOP-VRV0NDO, executor driver, partition 35, PROCESS_LOCAL, 4567 bytes) taskResourceAssignments Map()
22/12/16 00:01:45 INFO Executor: Running task 35.0 in stage 1.0 (TID 71)
22/12/16 00:01:45 INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 39536 ms on DESKTOP-VRV0NDO (executor driver) (36/36)
22/12/16 00:01:45 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
22/12/16 00:01:45 INFO DAGScheduler: ShuffleMapStage 0 (count at SparkMock100wScala.scala:109) finished in 39.893 s
22/12/16 00:01:45 INFO DAGScheduler: looking for newly runnable stages
22/12/16 00:01:45 INFO DAGScheduler: running: Set(ShuffleMapStage 1)
22/12/16 00:01:45 INFO DAGScheduler: waiting: Set()
22/12/16 00:01:45 INFO DAGScheduler: failed: Set()
22/12/16 00:01:58 INFO MemoryStore: Block rdd_14_0 stored as values in memory (estimated size 49.1 MiB, free 13.9 GiB)
22/12/16 00:01:58 INFO BlockManagerInfo: Added rdd_14_0 in memory on DESKTOP-VRV0NDO:60638 (size: 49.1 MiB, free: 13.9 GiB)
22/12/16 00:01:59 INFO MemoryStore: Block rdd_14_2 stored as values in memory (estimated size 50.2 MiB, free 13.9 GiB)
22/12/16 00:01:59 INFO BlockManagerInfo: Added rdd_14_2 in memory on DESKTOP-VRV0NDO:60638 (size: 50.2 MiB, free: 13.9 GiB)
22/12/16 00:02:00 INFO MemoryStore: Block rdd_14_5 stored as values in memory (estimated size 52.8 MiB, free 13.8 GiB)
22/12/16 00:02:00 INFO BlockManagerInfo: Added rdd_14_5 in memory on DESKTOP-VRV0NDO:60638 (size: 52.8 MiB, free: 13.8 GiB)
22/12/16 00:02:00 INFO MemoryStore: Block rdd_14_4 stored as values in memory (estimated size 52.8 MiB, free 13.8 GiB)
22/12/16 00:02:00 INFO BlockManagerInfo: Added rdd_14_4 in memory on DESKTOP-VRV0NDO:60638 (size: 52.8 MiB, free: 13.8 GiB)
22/12/16 00:02:00 INFO MemoryStore: Block rdd_14_6 stored as values in memory (estimated size 52.8 MiB, free 13.7 GiB)
22/12/16 00:02:00 INFO BlockManagerInfo: Added rdd_14_6 in memory on DESKTOP-VRV0NDO:60638 (size: 52.8 MiB, free: 13.7 GiB)
22/12/16 00:02:00 INFO MemoryStore: Block rdd_14_1 stored as values in memory (estimated size 50.2 MiB, free 13.7 GiB)
22/12/16 00:02:00 INFO BlockManagerInfo: Added rdd_14_1 in memory on DESKTOP-VRV0NDO:60638 (size: 50.2 MiB, free: 13.7 GiB)
22/12/16 00:02:01 INFO MemoryStore: Block rdd_14_3 stored as values in memory (estimated size 51.2 MiB, free 13.6 GiB)
22/12/16 00:02:01 INFO BlockManagerInfo: Added rdd_14_3 in memory on DESKTOP-VRV0NDO:60638 (size: 51.2 MiB, free: 13.6 GiB)
22/12/16 00:02:01 INFO MemoryStore: Block rdd_14_8 stored as values in memory (estimated size 52.8 MiB, free 13.6 GiB)
22/12/16 00:02:01 INFO BlockManagerInfo: Added rdd_14_8 in memory on DESKTOP-VRV0NDO:60638 (size: 52.8 MiB, free: 13.6 GiB)
22/12/16 00:02:01 INFO MemoryStore: Block rdd_14_7 stored as values in memory (estimated size 52.8 MiB, free 13.5 GiB)
22/12/16 00:02:01 INFO BlockManagerInfo: Added rdd_14_7 in memory on DESKTOP-VRV0NDO:60638 (size: 52.8 MiB, free: 13.5 GiB)
22/12/16 00:02:01 INFO Executor: Finished task 0.0 in stage 1.0 (TID 36). 2242 bytes result sent to driver
22/12/16 00:02:02 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 36) in 24227 ms on DESKTOP-VRV0NDO (executor driver) (1/36)
22/12/16 00:02:02 INFO MemoryStore: Block rdd_14_9 stored as values in memory (estimated size 52.8 MiB, free 13.5 GiB)
22/12/16 00:02:02 INFO BlockManagerInfo: Added rdd_14_9 in memory on DESKTOP-VRV0NDO:60638 (size: 52.8 MiB, free: 13.5 GiB)
22/12/16 00:02:02 INFO MemoryStore: Block rdd_14_11 stored as values in memory (estimated size 52.8 MiB, free 13.4 GiB)
22/12/16 00:02:02 INFO BlockManagerInfo: Added rdd_14_11 in memory on DESKTOP-VRV0NDO:60638 (size: 52.8 MiB, free: 13.4 GiB)
22/12/16 00:02:03 INFO Executor: Finished task 2.0 in stage 1.0 (TID 38). 2242 bytes result sent to driver
22/12/16 00:02:03 INFO TaskSetManager: Finished task 2.0 in stage 1.0 (TID 38) in 25452 ms on DESKTOP-VRV0NDO (executor driver) (2/36)
22/12/16 00:02:03 INFO MemoryStore: Block rdd_14_10 stored as values in memory (estimated size 52.8 MiB, free 13.4 GiB)
22/12/16 00:02:03 INFO BlockManagerInfo: Added rdd_14_10 in memory on DESKTOP-VRV0NDO:60638 (size: 52.8 MiB, free: 13.4 GiB)
22/12/16 00:02:03 INFO Executor: Finished task 5.0 in stage 1.0 (TID 41). 2242 bytes result sent to driver
22/12/16 00:02:03 INFO TaskSetManager: Finished task 5.0 in stage 1.0 (TID 41) in 25521 ms on DESKTOP-VRV0NDO (executor driver) (3/36)
22/12/16 00:02:03 INFO Executor: Finished task 4.0 in stage 1.0 (TID 40). 2199 bytes result sent to driver
22/12/16 00:02:03 INFO TaskSetManager: Finished task 4.0 in stage 1.0 (TID 40) in 26015 ms on DESKTOP-VRV0NDO (executor driver) (4/36)
22/12/16 00:02:03 INFO Executor: Finished task 6.0 in stage 1.0 (TID 42). 2199 bytes result sent to driver
日志大多没啥子用。看Web UI更舒服一点。也就是报错时需要看看Error的Log来定位问题。
结论
通过简单的比对,可以发现:
1,Spark3.3.0中Spark SQL和DataFrame性能已经没有明显差距,甚至SQL写得好时性能还可能超越DataFrame【Spark3.0之前不一定适用】
2,资深SQL Boy写的SQL性能比肤浅的SQL Boy写的SQL性能好,这句废话意在强调平时就需要根据Map和Reduce原理分析SQL该怎么写性能才更好
3,当内存不足时,增加CPU算力并不能显著提升性能,尤其是这类SQL Boy们最喜欢的有Join【或者Group by】的情况,瓶颈是Shuffle,Shuffle的性能瓶颈又是硬盘IO【完全分布式集群可能还存在跨VPC、跨地域、跨机架等情况,性能瓶颈也可能是交换机带宽】
所以,任务跑得慢,单纯优化HQL或者Spark SQL肯定是不够的,该+机器还得+,但是+机器也要先分清楚任务类型【计算密集型、内存密集型】,不能盲目+,否则效果不明显。
不得不说,RSS【Remote Shuffle Service】是个好东西。例如:
Alluxio:https://www.alluxio.com.cn/
案例:https://www.alluxio.com.cn/oppo-use-case-shuttle-and-alluxio/
还有个阿里爸爸最近随着Flink1.16.0一起开源的Celeborn也是个好东西。
Celeborn:http://celeborn.incubator.apache.org/
Celeborn is dedicated to improving the efficiency and elasticity of different map-reduce engines. RSS provides an elastic and high efficient management service for shuffle data.
The current stable version is 0.1.4
转载请注明出处:https://lizhiyong.blog.csdn.net/article/details/128337608