RDD算子使用----transformation转换操作

RDD算子使用

transformation转换操作

map算子

rdd.map(p: A => B):RDD，对rdd集合中的每一个元素，都作用一次该func函数，之后返回值为生成元素构成的一个新的RDD。

val sc = new SparkContext(conf)

//map 原集合*7

val list = 1 to 7

//构建一个rdd

val listRDD:RDD[Int] = sc.parallelize(list)

//listRDD.map((num:Int) => num * 7)

//listRDD.map(num => num * 7)

val ret = listRDD.map(_ * 7)

ret.foreach(println)

flatMap算子

rdd.flatMap(p: A => 集合):RDD ==>rdd集合中的每一个元素，都要作用func函数，返回0到多个新的元素，这些新的元素共同构成一个新的RDD。

def  flatMapOps(sc:SparkContext): Unit = {

val list = List(

"jia jing kan kan kan",

"gao di di  di di",

"zhan yuan qi qi"

)

val listRDD = sc.parallelize(list)

listRDD.flatMap(line => line.split("\\s+"))

.foreach(println)

}

mapPartitions算子

mapPartitions(p: Iterator[A] =>Iterator[B])，上面的map操作，一次处理一条记录；而mapPartitions一次性处理一个partition分区中的数据。

注意：虽说mapPartitions的执行性能要高于map，但是其一次性将一个分区的数据加载到执行内存空间，如果该分区数据集比较大，存在OOM的风险。

//创建RDD并指定分区数

val array = 1 to 10

val listRDD:RDD[Int] = sc.parallelize(array,2)
//通过-将分区之间的数据连接
val result: RDD[String] = rdd.mapPartitions(x=>Iterator(x.mkString("-")))
//打印输出
println(result.collect().toBuffer)

mapPartitionsWithIndex算子

mapPartitionsWithIndex((index,p:Iterator[A]=>Iterator[B]))，该操作比mapPartitions多了一个index，代表就是后面p所对应的分区编号:rdd的分区编号，命名规范，如果有N个分区，分区编号就从0，...，N-1。

val rdd: RDD[Int] = sc.parallelize(1 to 16,4)
//查看每个分区当中都保存了哪些数据
val result: RDD[String] = rdd.mapPartitionsWithIndex((index,item)=>Iterator(index+":"+item.mkString(",")))
//打印输出
result.foreach(println)

sample算子

1)sample(withReplacement, fraction, seed):随机抽样算子，sample主要工作就是为了来研究数据本身，去代替全量研究会出现类似数据倾斜(dataSkew)等问题，无法进行全量研究，只能用样本去评估整体。

2)withReplacement：Boolean：有放回的抽样和无放回的抽样。

3)fraction：Double：样本空间占整体数据量的比例，大小在[0,1]。

4)seed：Long：是一个随机数的种子，有默认值，通常不需要传参。

需要说明一点的是，这个抽样是一个不准确的抽样，抽取的结果数可能在准确的结果上下浮动。

def sampleOps(sc: SparkContext): Unit = {

val list = sc.parallelize(1 to 100000)

val sampled1 = list.sample(true, 0.01)

println("sampled1 count: " + sampled1.count())

val sampled2 = list.sample(false, 0.01)

println("sampled2 count: " + sampled2.count())

}

union算子

rdd1.union(rdd2)相当于sql中的union all，进行两个rdd数据间的联合，需要说明一点是，该union是一个窄依赖操作，rdd1如果有N个分区，rdd2有M个分区，那么union之后的分区个数就为N+M。

val rdd1 = sc.parallelize(1 to 5,3)
//查看每个分区当中都保存了哪些数据
rdd1.mapPartitionsWithIndex((index,item)=>Iterator(index+":"+item.mkString(","))).foreach(println)
val rdd2: RDD[Int] = sc.parallelize(5 to 7,2)
//查看每个分区当中都保存了哪些数据
rdd2.mapPartitionsWithIndex((index,item)=>Iterator(index+":"+item.mkString(","))).foreach(println)
//union整合
val result: RDD[Int] = rdd1.union(rdd2)
//获取数据
println(result.collect().toBuffer)
//查看每个分区当中都保存了哪些数据
result.mapPartitionsWithIndex((index,item)=>Iterator(index+":"+item.mkString(","))).foreach(println)
//获取分区数
println(result.getNumPartitions)

join算子

1)rdd1.join(rdd2) 相当于sql中的join连接操作

A(id) a, B(aid) b

select * from A a join B b on a.id = b.aid

2)交叉连接: across join

select * from A a across join B ====>这回产生笛卡尔积。

3)内连接： inner join，提取左右两张表中的交集

select * from A a inner join B on a.id = b.aid 或者

select * from A a, B b where a.id = b.aid

4)外连接：outer join

5)左外连接：left outer join 返回左表所有，右表匹配返回，匹配不上返回null

select * from A a left outer join B on a.id = b.aid

6)右外连接：right outer join 刚好是左外连接的相反

select * from A a left outer join B on a.id = b.aid

7)全连接：full join

8)全外连接：

full outer join = left outer join + right outer join

前提：要先进行join，rdd的类型必须是K-V。

对join操作可以归纳为如图-22所示。

图-22 sql 各种join连接图

def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setAppName("demo").setMaster("local[*]")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")
val rdd1: RDD[(Int, String)] = sc.parallelize(Array((1,"a"),(2,"b"),(3,"c")))
val rdd2: RDD[(Int, Int)] = sc.parallelize(Array((1,4),(2,5),(5,6)))
//join操作
val result: RDD[(Int, (String, Int))] = rdd1.join(rdd2)
//打印输出
println(result.collect().toBuffer)
//leftOutJoin操作
val result1: RDD[(Int, (String, Option[Int]))] = rdd1.leftOuterJoin(rdd2)
//打印输出
println(result1.collect().toBuffer)
//rightOuterJoin
val result2: RDD[(Int, (Option[String], Int))] = rdd1.rightOuterJoin(rdd2)
//打印输出
println(result2.collect().toBuffer)
//fullOuterJoin
val result3: RDD[(Int, (Option[String], Option[Int]))] = rdd1.fullOuterJoin(rdd2)
//打印输出
println(result3.collect().toBuffer)
}

coalesce算子

1)coalesce(numPartition, shuffle=false):分区合并的意思。

2)numPartition:分区后的分区个数。

3)shuffle：此次重分区是否开启shuffle，决定当前的操作是宽(true)依赖还是窄(false)依赖。

def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setAppName("wc").setMaster("local[*]")
val sc = new SparkContext(conf)
sc.setLogLevel("WARN")
val rdd = sc.parallelize(1 to 16,4)
//获取分区数
println(rdd.getNumPartitions)
//查看每个分区当中保存了哪些数据
rdd.mapPartitionsWithIndex((index,item)=>Iterator(index+":"+item.mkString(","))).foreach(println)
//缩减分区
val rdd1: RDD[Int] = rdd.coalesce(3)
//获取分区数
println(rdd1.getNumPartitions)
//查看每个分区当中保存的数据变化
rdd1.mapPartitionsWithIndex((index,item)=>Iterator(index+":"+item.mkString(","))).foreach(println)
//释放资源

sc.stop()
}

repartition(numPartitions)算子

根据分区数，从新通过网络随机洗牌所有数据。

def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setAppName("wc").setMaster("local[*]")
val sc = new SparkContext(conf)

//设置控制台日志级别
sc.setLogLevel("WARN")
val rdd = sc.parallelize(1 to 16,4)
//获取分区数
println(rdd.getNumPartitions)
//查看每个分区当中保存了哪些数据
rdd.mapPartitionsWithIndex((index,item)=>Iterator(index+":"+item.mkString(","))).foreach(println)
//缩减分区
val rdd1: RDD[Int] = rdd.repartition(3)
//获取分区数
println(rdd1.getNumPartitions)
//查看每个分区当中保存的数据变化

rdd1.mapPartitionsWithIndex((index,item)=>Iterator(index+":"+item.mkString(","))).foreach(println)
//释放资源
sc.stop()
}

sortBy(func,[ascending], [numTasks])算子

用func先对数据进行处理，按照处理后的数据比较结果排序。

def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setAppName("demo").setMaster("local[*]")
val sc = new SparkContext(conf)

//设置控制台日志输出级别
sc.setLogLevel("WARN")

//加载数据
val rdd= sc.parallelize(List(("a",4),("c",2),("b",1)))

//默认是升序排序，指定false，转为倒叙输出
val rdd1: RDD[(String, Int)] = rdd.sortBy(_._2,false)

//收集结果，返回数组输出
println(rdd1.collect().toBuffer)
}

sortByKey([ascending],[numTasks])算子

在一个(K,V)的RDD上调用，K必须实现Ordered接口，返回一个按照key进行排序的(K,V)的RDD。

def main(args: Array[String]): Unit = {
val conf: SparkConf = new SparkConf().setAppName("demo").setMaster("local[*]")
val sc = new SparkContext(conf)

//设置控制台日志输出级别
sc.setLogLevel("WARN")

//加载数据
val rdd = sc.parallelize(Array((3,"aa"),(6,"cc"),(2,"bb"),(1,"dd")))

//默认按照key进行升序输出，加false，转为倒叙输出
val result: RDD[(Int, String)] = rdd.sortByKey(false)

//收集结果，返回数组输出
println(result.collect().toBuffer)
}

groupByKey算子

groupByKey(numPartition):[K, Iterable[V]]

按照key来进行分组，numPartition指的是分组之后的分区个数。

这是一个宽依赖操作，但是需要注意一点的是，groupByKey相比较reduceByKey而言，没有本地预聚合操作，显然其效率并没有reduceByKey效率高，在使用的时候如果可以，尽量使用reduceByKey等去代替groupByKey。

def groupByOps(sc: SparkContext): Unit = {

case class Student(id: Int, name:String, province: String)

val stuRDD = sc.parallelize(List(

Student(1, "张三", "安徽"),

Student(2, "李梦", "山东"),

Student(3, "王五", "甘肃"),

Student(4, "周七", "甘肃"),

Student(5, "Lucy", "黑吉辽"),

Student(10086, "魏八", "黑吉辽")

))

//val province2Infos: RDD[(String, Iterable[Student]] = stuRDD.groupBy(stu => stu.province)

val province2Infos: RDD[(String, Iterable[Student])] = stuRDD.map(stu => (stu.province, stu)).groupByKey()



province2Infos.foreach{case (province, stus) => {

println(s"省份：${province}, 学生信息：${stus.mkString(", ")}, 人数：${stus.size}")

}}

}

}

reduceByKey算子

reduceByKey((A1, A2) => A3)

前提不是对全量的数据集进行reduce操作，而是对每一个key所对应的所有的value进行reduce操作。

def reduceByKeyOps(sc: SparkContext): Unit = {

case class Student(id: Int, name:String, province: String)

val stuRDD = sc.parallelize(List(

Student(1, "张三", "安徽"),

Student(2, "李梦", "山东"),

Student(3, "王五", "甘肃"),

Student(4, "周七", "甘肃"),

Student(5, "Lucy", "黑吉辽"),

Student(10086, "魏八", "黑吉辽")

))

val ret = stuRDD.map(stu => (stu.province, 1)).reduceByKey((v1, v2) => v1 + v2)

ret.foreach{case (province, count) => {

println(s"province: ${province}, count: ${count}")

}}

}

foldByKey算子

foldByKey(zeroValue)((A1, A2) => A3)，其作用和reduceByKey一样，唯一的区别就是zeroValue初始化值不一样，相当于在scala集合操作中的reduce和fold的区别。

def foldByKeyOps(sc: SparkContext): Unit = {

case class Student(id: Int, name:String, province: String)

val stuRDD = sc.parallelize(List(

Student(1, "张三", "安徽"),

Student(3, "王五", "甘肃"),

Student(5, "Lucy", "黑吉辽"),

Student(2, "李梦", "山东"),

Student(4, "周七", "甘肃"),

Student(10086, "魏八", "黑吉辽")

), 2).mapPartitionsWithIndex((index, partition) => {

val list = partition.toList

println(s"-->stuRDD的分区编号为<${index}>中的数据为：${list.mkString("[", ", ", "]")}")

list.toIterator

})

val ret = stuRDD.map(stu => (stu.province, 1)).foldByKey(0)((v1, v2) => v1 + v2)

ret.foreach{case (province, count) => {

println(s"province: ${province}, count: ${count}")

}}

}

aggregateByKey算子

combineByKey和aggregateByKey的区别就相当于reduceByKey和foldByKey。

def abk2rbk(sc: SparkContext): Unit = {

val array = sc.parallelize(Array(

"hello you",

"hello me",

"hello you",

"hello you",

"hello me",

"hello you"

), 2)

val pairs = array.flatMap(_.split("\\s+")).map((_, 1))

val ret = pairs.aggregateByKey(0)(_+_, _+_)

ret.foreach{case (key, count) => {

println(s"key: ${key}, count: ${count}")

}}

}