欢迎访问 生活随笔!

生活随笔

当前位置: 首页 > 编程资源 > 编程问答 >内容正文

编程问答

Spark学习之Spark RDD算子

发布时间:2025/6/17 编程问答 45 豆豆
生活随笔 收集整理的这篇文章主要介绍了 Spark学习之Spark RDD算子 小编觉得挺不错的,现在分享给大家,帮大家做个参考.

个人主页zicesun.com

这里,从源码的角度总结一下Spark RDD算子的用法。

单值型Transformation算子

map

/*** Return a new RDD by applying a function to all elements of this RDD.*/def map[U: ClassTag](f: T => U): RDD[U] = withScope {val cleanF = sc.clean(f)new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.map(cleanF))} 复制代码

源码中有一个 sc.clean() 函数,它的所用是去除闭包中不能序列话的外部引用变量。Scala支持闭包,闭包会把它对外的引用(闭包里面引用了闭包外面的对像)保存到自己内部,这个闭包就可以被单独使用了,而不用担心它脱离了当前的作用域;但是在spark这种分布式环境里,这种作法会带来问题,如果对外部的引用是不可serializable的,它就不能正确被发送到worker节点上去了;还有一些引用,可能根本没有用到,这些没有使用到的引用是不需要被发到worker上的; 实际上sc.clean函数调用的是ClosureCleaner.clean();ClosureCleaner.clean()通过递归遍历闭包里面的引用,检查不能serializable的, 去除unused的引用;

map函数是一个粗粒度的操作,对于一个RDD来说,会使用迭代器对分区进行遍历,然后针对一个分区使用你想要执行的操作f, 然后返回一个新的RDD。其实可以理解为rdd的每一个元素都会执行同样的操作。

scala> val array = Array(1,2,3,4,5,6) array: Array[Int] = Array(1, 2, 3, 4, 5, 6) scala> val rdd = sc.app appName applicationAttemptId applicationId scala> val rdd = sc.parallelize(array, 2) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:26 scala> val mapRdd = rdd.map(x => x * 2) mapRdd: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[1] at map at <console>:25 scala> mapRdd.collect().foreach(println) 2 4 6 8 10 12 复制代码

flatMap

flatMap方法与map方法类似,但是允许一次map方法中输出多个对象,而不是map中的一个对象经过函数转换成另一个对象。

/*** Return a new RDD by first applying a function to all elements of this* RDD, and then flattening the results.*/def flatMap[U: ClassTag](f: T => TraversableOnce[U]): RDD[U] = withScope {val cleanF = sc.clean(f)new MapPartitionsRDD[U, T](this, (context, pid, iter) => iter.flatMap(cleanF))} 复制代码scala> val a = sc.parallelize(1 to 10, 5) a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:24 scala> a.flatMap(num => 1 to num).collect res1: Array[Int] = Array(1, 1, 2, 1, 2, 3, 1, 2, 3, 4, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10)复制代码

mapPartitions

mapPartitions是map的另一个实现,map的输入函数应用与RDD的每个元素,但是mapPartitions的输入函数作用于每个分区,也就是每个分区的内容作为整体。

/*** Return a new RDD by applying a function to each partition of this RDD.** `preservesPartitioning` indicates whether the input function preserves the partitioner, which* should be `false` unless this is a pair RDD and the input function doesn't modify the keys.*/def mapPartitions[U: ClassTag](f: Iterator[T] => Iterator[U],preservesPartitioning: Boolean = false): RDD[U] = withScope {val cleanedF = sc.clean(f)new MapPartitionsRDD(this,(context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(iter),preservesPartitioning)} 复制代码scala> def myfunc[T](iter: Iterator[T]):Iterator[(T,T)]={ | var res = List[(T,T)]()| var pre = iter.next| while(iter.hasNext){| var cur = iter.next| res .::= (pre, cur)| pre = cur| }| res.iterator| } myfunc: [T](iter: Iterator[T])Iterator[(T, T)] scala> val a = sc.parallelize(1 to 9,3) a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:24 scala> a.mapPartitions mapPartitions mapPartitionsWithIndex scala> a.mapPartitions(myfunc).collect res0: Array[(Int, Int)] = Array((2,3), (1,2), (5,6), (4,5), (8,9), (7,8)) 复制代码

mapPartitionWithIndex

mapPartitionWithIndex方法与mapPartitions方法类似,不同的是mapPartitionWithIndex会对原始分区的索引进行追踪,这样就可以知道分区所对应的元素,方法的参数为一个函数,函数的输入为整型索引和迭代器。

/*** Return a new RDD by applying a function to each partition of this RDD, while tracking the index* of the original partition.** `preservesPartitioning` indicates whether the input function preserves the partitioner, which* should be `false` unless this is a pair RDD and the input function doesn't modify the keys.*/def mapPartitionsWithIndex[U: ClassTag](f: (Int, Iterator[T]) => Iterator[U],preservesPartitioning: Boolean = false): RDD[U] = withScope {val cleanedF = sc.clean(f)new MapPartitionsRDD(this,(context: TaskContext, index: Int, iter: Iterator[T]) => cleanedF(index, iter),preservesPartitioning)} 复制代码scala> val x = sc.parallelize(1 to 10, 3) x: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:24 scala> def myFunc(index:Int, iter:Iterator[Int]):Iterator[String]={| iter.toList.map(x => index + "," + x).iterator| } myFunc: (index: Int, iter: Iterator[Int])Iterator[String] scala> x.mapPartitions mapPartitions mapPartitionsWithIndex scala> x.mapPartitionsWithIndex(myFunc).collect res1: Array[String] = Array(0,1, 0,2, 0,3, 1,4, 1,5, 1,6, 2,7, 2,8, 2,9, 2,10) 复制代码

foreach

foreach主要对每一个输入的数据对象执行循环操作,可以用来执行对RDD元素的输出操作。

/*** Applies a function f to all elements of this RDD.*/def foreach(f: T => Unit): Unit = withScope {val cleanF = sc.clean(f)sc.runJob(this, (iter: Iterator[T]) => iter.foreach(cleanF))} 复制代码scala> var x = sc.parallelize(List(1 to 9), 3) x: org.apache.spark.rdd.RDD[scala.collection.immutable.Range.Inclusive] = ParallelCollectionRDD[5] at parallelize at <console>:24 scala> x.foreach(print) Range(1, 2, 3, 4, 5, 6, 7, 8, 9) 复制代码

foreachPartition

foreachPartition方法和mapPartition的作用一样,通过迭代器参数对RDD中每一个分区的数据对象应用函数,区别在于使用的参数是否有返回值。

/*** Applies a function f to each partition of this RDD.*/def foreachPartition(f: Iterator[T] => Unit): Unit = withScope {val cleanF = sc.clean(f)sc.runJob(this, (iter: Iterator[T]) => cleanF(iter))} 复制代码scala> val b = sc.parallelize(List(1,2,3,4,5,6), 3) b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:24 scala> b.foreachPartition(x => println(x.reduce((a,b) => a +b))) 7 3 11 复制代码

glom

glom的作用与collec类似,collect是将RDD直接转化为数组的形式,而glom则是将RDD分区数据组装到数组类型的RDD中,每一个返回的数组包含一个分区的所有元素,按分区转化为数组,有几个分区就返回几个数组类型的RDD。

/*** Return an RDD created by coalescing all elements within each partition into an array.*/def glom(): RDD[Array[T]] = withScope {new MapPartitionsRDD[Array[T], T](this, (context, pid, iter) => Iterator(iter.toArray))} 复制代码

下面的例子中,RDD a有三个分区,glom将a转化为由三个数组构成的RDD。

scala> val a = sc.parallelize(1 to 9, 3) a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:24 scala> a.glom.collect res5: Array[Array[Int]] = Array(Array(1, 2, 3), Array(4, 5, 6), Array(7, 8, 9)) scala> a.glom res6: org.apache.spark.rdd.RDD[Array[Int]] = MapPartitionsRDD[4] at glom at <console>:26 复制代码

union

union方法与++方法是等价的,将两个RDD去并集,取并集的过程中不会去重。

/*** Return the union of this RDD and another one. Any identical elements will appear multiple* times (use `.distinct()` to eliminate them).*/def union(other: RDD[T]): RDD[T] = withScope {sc.union(this, other)}/*** Return the union of this RDD and another one. Any identical elements will appear multiple* times (use `.distinct()` to eliminate them).*/def ++(other: RDD[T]): RDD[T] = withScope {this.union(other)} 复制代码scala> val a = sc.parallelize(1 to 4,2) a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[5] at parallelize at <console>:24 scala> val b = sc.parallelize(2 to 5,1) b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[6] at parallelize at <console>:24 scala> a.un union unpersist scala> a.union(b).collect res7: Array[Int] = Array(1, 2, 3, 4, 2, 3, 4, 5)复制代码

cartesian

计算两个RDD中每个对象的笛卡尔积

/*** Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of* elements (a, b) where a is in `this` and b is in `other`.*/def cartesian[U: ClassTag](other: RDD[U]): RDD[(T, U)] = withScope {new CartesianRDD(sc, this, other)}复制代码cala> val a = sc.parallelize(1 to 4,2) a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[5] at parallelize at <console>:24 scala> val b = sc.parallelize(2 to 5,1) b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[6] at parallelize at <console>:24 scala> a.cartesian(b).collect res8: Array[(Int, Int)] = Array((1,2), (1,3), (1,4), (1,5), (2,2), (2,3), (2,4), (2,5), (3,2), (3,3), (3,4), (3,5), (4,2), (4,3), (4,4), (4,5))复制代码

groupBy

groupBy方法有三个重载方法,功能是讲元素通过map函数生成Key-Value格式,然后使用groupByKey方法对Key-Value进行聚合。

/*** Return an RDD of grouped items. Each group consists of a key and a sequence of elements* mapping to that key. The ordering of elements within each group is not guaranteed, and* may even differ each time the resulting RDD is evaluated.** @note This operation may be very expensive. If you are grouping in order to perform an* aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`* or `PairRDDFunctions.reduceByKey` will provide much better performance.*/def groupBy[K](f: T => K)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])] = withScope {groupBy[K](f, defaultPartitioner(this))}/*** Return an RDD of grouped elements. Each group consists of a key and a sequence of elements* mapping to that key. The ordering of elements within each group is not guaranteed, and* may even differ each time the resulting RDD is evaluated.** @note This operation may be very expensive. If you are grouping in order to perform an* aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`* or `PairRDDFunctions.reduceByKey` will provide much better performance.*/def groupBy[K](f: T => K,numPartitions: Int)(implicit kt: ClassTag[K]): RDD[(K, Iterable[T])] = withScope {groupBy(f, new HashPartitioner(numPartitions))}/*** Return an RDD of grouped items. Each group consists of a key and a sequence of elements* mapping to that key. The ordering of elements within each group is not guaranteed, and* may even differ each time the resulting RDD is evaluated.** @note This operation may be very expensive. If you are grouping in order to perform an* aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`* or `PairRDDFunctions.reduceByKey` will provide much better performance.*/def groupBy[K](f: T => K, p: Partitioner)(implicit kt: ClassTag[K], ord: Ordering[K] = null): RDD[(K, Iterable[T])] = withScope {val cleanF = sc.clean(f)this.map(t => (cleanF(t), t)).groupByKey(p)}复制代码scala> val a = sc.parallelize(1 to 9,2) a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at parallelize at <console>:24 scala> a.groupBy(x => {if(x % 2 == 0) "even" else "odd"}).collect res9: Array[(String, Iterable[Int])] = Array((even,CompactBuffer(2, 4, 6, 8)), (odd,CompactBuffer(1, 3, 5, 7, 9))) scala> def myfunc(a: Int):Int={| a % 2| } myfunc: (a: Int)Int scala> a.groupBy(myfunc).collect res10: Array[(Int, Iterable[Int])] = Array((0,CompactBuffer(2, 4, 6, 8)), (1,CompactBuffer(1, 3, 5, 7, 9))) scala> a.groupBy(myfunc(_), 1).collect res11: Array[(Int, Iterable[Int])] = Array((0,CompactBuffer(2, 4, 6, 8)), (1,CompactBuffer(1, 3, 5, 7, 9)))复制代码

filter

filter方法对输入元素进行过滤,参数是一个返回值为boolean的函数,如果函数对元素的运算结果为true,则通过元素,否则就将该元素过滤,不进入结果集。

/*** Return a new RDD containing only the elements that satisfy a predicate.*/def filter(f: T => Boolean): RDD[T] = withScope {val cleanF = sc.clean(f)new MapPartitionsRDD[T, T](this,(context, pid, iter) => iter.filter(cleanF),preservesPartitioning = true)}复制代码scala> val a = sc.parallelize(List("we", "are", "from", "China", "not", "from", "America")) a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[16] at parallelize at <console>:24 scala> val b = a.filter(x => x.length >= 4) b: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[17] at filter at <console>:25 scala> b.collect.foreach(println) from China from America复制代码

distinct

distinct方法将RDD中重复的元素去掉,只留下唯一的RDD元素。

/*** Return a new RDD containing the distinct elements in this RDD.*/def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {map(x => (x, null)).reduceByKey((x, y) => x, numPartitions).map(_._1)}复制代码scala> val a = sc.parallelize(List("we", "are", "from", "China", "not", "from", "America")) a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[18] at parallelize at <console>:24 scala> val b = a.map(x => x.length) b: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[19] at map at <console>:25 scala> val c = b.distinct c: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[22] at distinct at <console>:25 scala> c.foreach(println) 5 4 2 3 7复制代码

subtract

subtract方法就是求集合A-B,即把集合A中包含集合B的元素都删除,返回剩下的元素。

/*** Return an RDD with the elements from `this` that are not in `other`.** Uses `this` partitioner/partition size, because even if `other` is huge, the resulting* RDD will be &lt;= us.*/def subtract(other: RDD[T]): RDD[T] = withScope {subtract(other, partitioner.getOrElse(new HashPartitioner(partitions.length)))}/*** Return an RDD with the elements from `this` that are not in `other`.*/def subtract(other: RDD[T], numPartitions: Int): RDD[T] = withScope {subtract(other, new HashPartitioner(numPartitions))}/*** Return an RDD with the elements from `this` that are not in `other`.*/def subtract(other: RDD[T],p: Partitioner)(implicit ord: Ordering[T] = null): RDD[T] = withScope {if (partitioner == Some(p)) {// Our partitioner knows how to handle T (which, since we have a partitioner, is// really (K, V)) so make a new Partitioner that will de-tuple our fake tuplesval p2 = new Partitioner() {override def numPartitions: Int = p.numPartitionsoverride def getPartition(k: Any): Int = p.getPartition(k.asInstanceOf[(Any, _)]._1)}// Unfortunately, since we're making a new p2, we'll get ShuffleDependencies// anyway, and when calling .keys, will not have a partitioner set, even though// the SubtractedRDD will, thanks to p2's de-tupled partitioning, already be// partitioned by the right/real keys (e.g. p).this.map(x => (x, null)).subtractByKey(other.map((_, null)), p2).keys} else {this.map(x => (x, null)).subtractByKey(other.map((_, null)), p).keys}}复制代码scala> val a = sc.parallelize(1 to 9, 2) a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[23] at parallelize at <console>:24 scala> val b = sc.parallelize(2 to 5, 4) b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[24] at parallelize at <console>:24 scala> val c = a.subtract(b) c: org.apache.spark.rdd.RDD[Int] = MapPartitionsRDD[28] at subtract at <console>:27 scala> c.collect res14: Array[Int] = Array(6, 8, 1, 7, 9)复制代码

persist与cache

cache,缓存数据,把RDD缓存到内存中,以便下次计算式再次被调用。persist是把RDD根据不同的级别进行持久化,通过参数指定持久化级别,如果不带参数则为默认持久化级别,即只保存到内存中,与Cache等价。

sample

sample方法的作用是随即对RDD中的元素进行采样,或得一个新的子RDD。根据参数制定是否放回采样,子集占总数的百分比和随机种子。

/*** Return a sampled subset of this RDD.** @param withReplacement can elements be sampled multiple times (replaced when sampled out)* @param fraction expected size of the sample as a fraction of this RDD's size* without replacement: probability that each element is chosen; fraction must be [0, 1]* with replacement: expected number of times each element is chosen; fraction must be greater* than or equal to 0* @param seed seed for the random number generator** @note This is NOT guaranteed to provide exactly the fraction of the count* of the given [[RDD]].*/def sample(withReplacement: Boolean,fraction: Double,seed: Long = Utils.random.nextLong): RDD[T] = {require(fraction >= 0,s"Fraction must be nonnegative, but got ${fraction}")withScope {require(fraction >= 0.0, "Negative fraction value: " + fraction)if (withReplacement) {new PartitionwiseSampledRDD[T, T](this, new PoissonSampler[T](fraction), true, seed)} else {new PartitionwiseSampledRDD[T, T](this, new BernoulliSampler[T](fraction), true, seed)}}}复制代码scala> val a = sc.parallelize(1 to 100, 2) a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[31] at parallelize at <console>:24 scala> val b = a.sample(false, 0.2, 0) b: org.apache.spark.rdd.RDD[Int] = PartitionwiseSampledRDD[32] at sample at <console>:25 scala> b.foreach(println) 5 19 20 26 27 29 30 57 40 61 45 68 73 50 75 79 81 85 89 99复制代码

键值对型transformation算子

groupByKey

类似于groupBy,将每一个相同的Key的Value聚集起来形成序列,可以使用默认的分区器和自定义的分区器。

/*** Group the values for each key in the RDD into a single sequence. Allows controlling the* partitioning of the resulting key-value pair RDD by passing a Partitioner.* The ordering of elements within each group is not guaranteed, and may even differ* each time the resulting RDD is evaluated.** @note This operation may be very expensive. If you are grouping in order to perform an* aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`* or `PairRDDFunctions.reduceByKey` will provide much better performance.** @note As currently implemented, groupByKey must be able to hold all the key-value pairs for any* key in memory. If a key has too many values, it can result in an `OutOfMemoryError`.*/def groupByKey(partitioner: Partitioner): RDD[(K, Iterable[V])] = self.withScope {// groupByKey shouldn't use map side combine because map side combine does not// reduce the amount of data shuffled and requires all map side data be inserted// into a hash table, leading to more objects in the old gen.val createCombiner = (v: V) => CompactBuffer(v)val mergeValue = (buf: CompactBuffer[V], v: V) => buf += vval mergeCombiners = (c1: CompactBuffer[V], c2: CompactBuffer[V]) => c1 ++= c2val bufs = combineByKeyWithClassTag[CompactBuffer[V]](createCombiner, mergeValue, mergeCombiners, partitioner, mapSideCombine = false)bufs.asInstanceOf[RDD[(K, Iterable[V])]]}/*** Group the values for each key in the RDD into a single sequence. Hash-partitions the* resulting RDD with into `numPartitions` partitions. The ordering of elements within* each group is not guaranteed, and may even differ each time the resulting RDD is evaluated.** @note This operation may be very expensive. If you are grouping in order to perform an* aggregation (such as a sum or average) over each key, using `PairRDDFunctions.aggregateByKey`* or `PairRDDFunctions.reduceByKey` will provide much better performance.** @note As currently implemented, groupByKey must be able to hold all the key-value pairs for any* key in memory. If a key has too many values, it can result in an `OutOfMemoryError`.*/def groupByKey(numPartitions: Int): RDD[(K, Iterable[V])] = self.withScope {groupByKey(new HashPartitioner(numPartitions))}复制代码scala> val a = sc.parallelize(List("mk", "zq", "xwc", "fig", "dcp", "snn"), 2) a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[33] at parallelize at <console>:24 scala> val b = a.keyBy(x => x.length) b: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[34] at keyBy at <console>:25 scala> b.groupByKey.collect res17: Array[(Int, Iterable[String])] = Array((2,CompactBuffer(mk, zq)), (3,CompactBuffer(xwc, fig, dcp, snn)))复制代码

combineByKey

comineByKey方法能够有效地讲键值对形式的RDD相同的Key的Value合并成序列形式,用户能自定义RDD的分区器和是否在Map端进行聚合操作。

/*** Generic function to combine the elements for each key using a custom set of aggregation* functions. This method is here for backward compatibility. It does not provide combiner* classtag information to the shuffle.** @see `combineByKeyWithClassTag`*/def combineByKey[C](createCombiner: V => C,mergeValue: (C, V) => C,mergeCombiners: (C, C) => C,partitioner: Partitioner,mapSideCombine: Boolean = true,serializer: Serializer = null): RDD[(K, C)] = self.withScope {combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners,partitioner, mapSideCombine, serializer)(null)}/*** Simplified version of combineByKeyWithClassTag that hash-partitions the output RDD.* This method is here for backward compatibility. It does not provide combiner* classtag information to the shuffle.** @see `combineByKeyWithClassTag`*/def combineByKey[C](createCombiner: V => C,mergeValue: (C, V) => C,mergeCombiners: (C, C) => C,numPartitions: Int): RDD[(K, C)] = self.withScope {combineByKeyWithClassTag(createCombiner, mergeValue, mergeCombiners, numPartitions)(null)}复制代码scala> val a = sc.parallelize(List("xwc", "fig","wc", "dcp", "zq", "znn", "mk", "zl", "hk", "lp"), 2) a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[36] at parallelize at <console>:24 scala> val b = sc.parallelize(List(1,2,2,3,2,1,2,2,2,3),2) b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[37] at parallelize at <console>:24 scala> val c = b.zip(a) c: org.apache.spark.rdd.RDD[(Int, String)] = ZippedPartitionsRDD2[38] at zip at <console>:27 scala> val d = c.combineByKey(List(_), (x:List[String], y:String)=>y::x, (x:List[String], y:List[String])=>x::: y) d: org.apache.spark.rdd.RDD[(Int, List[String])] = ShuffledRDD[39] at combineByKey at <console>:25 scala> d.collect res18: Array[(Int, List[String])] = Array((2,List(zq, wc, fig, hk, zl, mk)), (1,List(xwc, znn)), (3,List(dcp, lp)))复制代码

上面的例子使用三个参数重载的方法,该方法的第一个参数createCombiner把元素V转换成另一类元素C,该例子中使用的参数是List(_),表示将输入元素放在List集合中;第二个参数mergeValue的含义是吧元素V合并到元素C中,该例子中使用的是(x:List[String],y:String)=>y::x,表示将y字符合并到x链表集合中;第三个参数的含义是讲两个C元素合并,该例子中使用的是(x:List[String], y:List[String])=>x:::y, 表示把x链表集合中的内容合并到y链表中。

reduceByKey

使用一个reduce函数来实现对想要的Key的value的聚合操作,发送给reduce前会在map端本地merge操作,该方法的底层实现是调用combineByKey方法的一个重载方法。

/*** Merge the values for each key using an associative and commutative reduce function. This will* also perform the merging locally on each mapper before sending results to a reducer, similarly* to a "combiner" in MapReduce.*/def reduceByKey(partitioner: Partitioner, func: (V, V) => V): RDD[(K, V)] = self.withScope {combineByKeyWithClassTag[V]((v: V) => v, func, func, partitioner)}/*** Merge the values for each key using an associative and commutative reduce function. This will* also perform the merging locally on each mapper before sending results to a reducer, similarly* to a "combiner" in MapReduce. Output will be hash-partitioned with numPartitions partitions.*/def reduceByKey(func: (V, V) => V, numPartitions: Int): RDD[(K, V)] = self.withScope {reduceByKey(new HashPartitioner(numPartitions), func)}/*** Merge the values for each key using an associative and commutative reduce function. This will* also perform the merging locally on each mapper before sending results to a reducer, similarly* to a "combiner" in MapReduce. Output will be hash-partitioned with the existing partitioner/* parallelism level.*/def reduceByKey(func: (V, V) => V): RDD[(K, V)] = self.withScope {reduceByKey(defaultPartitioner(self), func)}复制代码scala> val a = sc.parallelize(List("dcp","fjg","snn","wc", "za"), 2) a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[0] at parallelize at <console>:24 scala> val b = a.map(x => (x.length,x)) b: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[2] at map at <console>:25 scala> b.reduceByKey((a, b) => a + b ).collect res1: Array[(Int, String)] = Array((2,wcza), (3,dcpfjgsnn))复制代码

sortByKey

根据Key值对键值对进行排序,如果是字符,则按照字典顺序排序,如果是数组则按照数字大小排序,可通过参数指定升序还是降序。

scala> val a = sc.parallelize(List("dcp","fjg","snn","wc", "za"), 2) a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[4] at parallelize at <console>:24 scala> val b = sc.parallelize(1 to a.count.toInt,2) b: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[5] at parallelize at <console>:26 scala> val c = a.zip(b) c: org.apache.spark.rdd.RDD[(String, Int)] = ZippedPartitionsRDD2[6] at zip at <console>:27 scala> c.sortByKey(true).collect res2: Array[(String, Int)] = Array((dcp,1), (fjg,2), (snn,3), (wc,4), (za,5))复制代码

cogroup

scala> val a = sc.parallelize(List(1,2,2,3,1,3),2) a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[10] at parallelize at <console>:24scala> val b = a.map(x => (x, "b")) b: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[11] at map at <console>:25scala> val c = a.map(x => (x, "c")) c: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[12] at map at <console>:25scala> b.cogroup(c).collect res3: Array[(Int, (Iterable[String], Iterable[String]))] = Array((2,(CompactBuffer(b, b),CompactBuffer(c, c))), (1,(CompactBuffer(b, b),CompactBuffer(c, c))), (3,(CompactBuffer(b, b),CompactBuffer(c, c))))scala> val a = sc.parallelize(List(1,2,2,2,1,3),1) a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[15] at parallelize at <console>:24scala> val b = a.map(x => (x, "b")) b: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[16] at map at <console>:25scala> val c = a.map(x => (x, "c")) c: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[17] at map at <console>:25scala> b.cogroup(c).collect res4: Array[(Int, (Iterable[String], Iterable[String]))] = Array((1,(CompactBuffer(b, b),CompactBuffer(c, c))), (3,(CompactBuffer(b),CompactBuffer(c))), (2,(CompactBuffer(b, b, b),CompactBuffer(c, c, c))))复制代码

join

首先对RDD进行cogroup操作,然后对每个新的RDD下Key的值进行笛卡尔积操作,再返回结果使用flatmapValue方法。

scala> val a= sc.parallelize(List("fjg","wc","xwc"),2) a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[20] at parallelize at <console>:24 scala> val c = sc.parallelize(List("fig", "wc", "sbb", "zq","xwc","dcp"), 2) c: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[22] at parallelize at <console>:24scala> val d = c.keyBy(_.length) d: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[23] at keyBy at <console>:25scala> val b = a.keyBy(_.length) b: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[24] at keyBy at <console>:25scala> b.join(d).collect res6: Array[(Int, (String, String))] = Array((2,(wc,wc)), (2,(wc,zq)), (3,(fjg,fig)), (3,(fjg,sbb)), (3,(fjg,xwc)), (3,(fjg,dcp)), (3,(xwc,fig)), (3,(xwc,sbb)), (3,(xwc,xwc)), (3,(xwc,dcp)))复制代码

Action算子

collect

把RDD中的元素以数组的形式返回。

/*** Return an array that contains all of the elements in this RDD.** @note This method should only be used if the resulting array is expected to be small, as* all the data is loaded into the driver's memory.*/def collect(): Array[T] = withScope {val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)Array.concat(results: _*)}复制代码scala> val a = sc.parallelize(List("a", "b", "c"),2) a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[28] at parallelize at <console>:24 scala> a.collect res7: Array[String] = Array(a, b, c)复制代码

reduce

使用一个带两个参数的函数把元素进行聚集,返回一个元素的结果。该函数中的二元操作应该满足交换律和结合律,这样才能在并行计算中得到正确的计算结果。

/*** Reduces the elements of this RDD using the specified commutative and* associative binary operator.*/def reduce(f: (T, T) => T): T = withScope {val cleanF = sc.clean(f)val reducePartition: Iterator[T] => Option[T] = iter => {if (iter.hasNext) {Some(iter.reduceLeft(cleanF))} else {None}}var jobResult: Option[T] = Noneval mergeResult = (index: Int, taskResult: Option[T]) => {if (taskResult.isDefined) {jobResult = jobResult match {case Some(value) => Some(f(value, taskResult.get))case None => taskResult}}}sc.runJob(this, reducePartition, mergeResult)// Get the final result out of our Option, or throw an exception if the RDD was emptyjobResult.getOrElse(throw new UnsupportedOperationException("empty collection"))}复制代码scala> val a = sc.parallelize(1 to 10, 2) a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[29] at parallelize at <console>:24 scala> a.reduce((a, b) => a + b) res8: Int = 55复制代码

take

take方法会从RDD中取出前n个元素。先扫描一个分区,之后从分区中得到结果,然后评估该分区的元素是否满足n,若果不满足则继续从其他分区中扫描获取。

/*** Take the first num elements of the RDD. It works by first scanning one partition, and use the* results from that partition to estimate the number of additional partitions needed to satisfy* the limit.** @note This method should only be used if the resulting array is expected to be small, as* all the data is loaded into the driver's memory.** @note Due to complications in the internal implementation, this method will raise* an exception if called on an RDD of `Nothing` or `Null`.*/def take(num: Int): Array[T] = withScope {val scaleUpFactor = Math.max(conf.getInt("spark.rdd.limit.scaleUpFactor", 4), 2)if (num == 0) {new Array[T](0)} else {val buf = new ArrayBuffer[T]val totalParts = this.partitions.lengthvar partsScanned = 0while (buf.size < num && partsScanned < totalParts) {// The number of partitions to try in this iteration. It is ok for this number to be// greater than totalParts because we actually cap it at totalParts in runJob.var numPartsToTry = 1Lval left = num - buf.sizeif (partsScanned > 0) {// If we didn't find any rows after the previous iteration, quadruple and retry.// Otherwise, interpolate the number of partitions we need to try, but overestimate// it by 50%. We also cap the estimation in the end.if (buf.isEmpty) {numPartsToTry = partsScanned * scaleUpFactor} else {// As left > 0, numPartsToTry is always >= 1numPartsToTry = Math.ceil(1.5 * left * partsScanned / buf.size).toIntnumPartsToTry = Math.min(numPartsToTry, partsScanned * scaleUpFactor)}}val p = partsScanned.until(math.min(partsScanned + numPartsToTry, totalParts).toInt)val res = sc.runJob(this, (it: Iterator[T]) => it.take(left).toArray, p)res.foreach(buf ++= _.take(num - buf.size))partsScanned += p.size}buf.toArray}}复制代码scala> val a = sc.parallelize(1 to 10, 2) a: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[30] at parallelize at <console>:24 scala> a.take(5) res9: Array[Int] = Array(1, 2, 3, 4, 5)复制代码

top

top会采用隐式排序转换来获取最大的前n个元素。

/*** Returns the top k (largest) elements from this RDD as defined by the specified* implicit Ordering[T] and maintains the ordering. This does the opposite of* [[takeOrdered]]. For example:* {{{* sc.parallelize(Seq(10, 4, 2, 12, 3)).top(1)* // returns Array(12)** sc.parallelize(Seq(2, 3, 4, 5, 6)).top(2)* // returns Array(6, 5)* }}}** @note This method should only be used if the resulting array is expected to be small, as* all the data is loaded into the driver's memory.** @param num k, the number of top elements to return* @param ord the implicit ordering for T* @return an array of top elements*/def top(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope {takeOrdered(num)(ord.reverse)}/*** Returns the first k (smallest) elements from this RDD as defined by the specified* implicit Ordering[T] and maintains the ordering. This does the opposite of [[top]].* For example:* {{{* sc.parallelize(Seq(10, 4, 2, 12, 3)).takeOrdered(1)* // returns Array(2)** sc.parallelize(Seq(2, 3, 4, 5, 6)).takeOrdered(2)* // returns Array(2, 3)* }}}** @note This method should only be used if the resulting array is expected to be small, as* all the data is loaded into the driver's memory.** @param num k, the number of elements to return* @param ord the implicit ordering for T* @return an array of top elements*/def takeOrdered(num: Int)(implicit ord: Ordering[T]): Array[T] = withScope {if (num == 0) {Array.empty} else {val mapRDDs = mapPartitions { items =>// Priority keeps the largest elements, so let's reverse the ordering.val queue = new BoundedPriorityQueue[T](num)(ord.reverse)queue ++= collectionUtils.takeOrdered(items, num)(ord)Iterator.single(queue)}if (mapRDDs.partitions.length == 0) {Array.empty} else {mapRDDs.reduce { (queue1, queue2) =>queue1 ++= queue2queue1}.toArray.sorted(ord)}}}复制代码scala> val c = sc.parallelize(Array(1,2,3,5,3,8,7,97,32),2) c: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[31] at parallelize at <console>:24 scala> c.top(3) res10: Array[Int] = Array(97, 32, 8)复制代码

count

count方法计算返回RDD中元素的个数。

/*** Return the number of elements in the RDD.*/def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum复制代码scala> val c = sc.parallelize(Array(1,2,3,5,3,8,7,97,32),2) c: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[33] at parallelize at <console>:24 scala> c.count res11: Long = 9复制代码

takeSample

返回一个固定大小的数组形式的采样子集,此外还会把返回元素的顺序随机打乱。

/*** Return a fixed-size sampled subset of this RDD in an array** @param withReplacement whether sampling is done with replacement* @param num size of the returned sample* @param seed seed for the random number generator* @return sample of specified size in an array** @note this method should only be used if the resulting array is expected to be small, as* all the data is loaded into the driver's memory.*/def takeSample(withReplacement: Boolean,num: Int,seed: Long = Utils.random.nextLong): Array[T] = withScope {val numStDev = 10.0require(num >= 0, "Negative number of elements requested")require(num <= (Int.MaxValue - (numStDev * math.sqrt(Int.MaxValue)).toInt),"Cannot support a sample size > Int.MaxValue - " +s"$numStDev * math.sqrt(Int.MaxValue)")if (num == 0) {new Array[T](0)} else {val initialCount = this.count()if (initialCount == 0) {new Array[T](0)} else {val rand = new Random(seed)if (!withReplacement && num >= initialCount) {Utils.randomizeInPlace(this.collect(), rand)} else {val fraction = SamplingUtils.computeFractionForSampleSize(num, initialCount,withReplacement)var samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()// If the first sample didn't turn out large enough, keep trying to take samples;// this shouldn't happen often because we use a big multiplier for the initial sizevar numIters = 0while (samples.length < num) {logWarning(s"Needed to re-sample due to insufficient sample size. Repeat #$numIters")samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()numIters += 1}Utils.randomizeInPlace(samples, rand).take(num)}}}}复制代码scala> val c = sc.parallelize(Array(1,2,3,5,3,8,7,97,32),2) c: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[34] at parallelize at <console>:24 scala> c.takeSample(true,3, 1) res14: Array[Int] = Array(1, 3, 7)复制代码

saveAsTextFile

将RDD存储为文本文件,一次存一行

countByKey

类似count,但是countByKey会根据Key计算对应的Value个数,返回Map类型的结果。

/*** Count the number of elements for each key, collecting the results to a local Map.** @note This method should only be used if the resulting map is expected to be small, as* the whole thing is loaded into the driver's memory.* To handle very large results, consider using rdd.mapValues(_ => 1L).reduceByKey(_ + _), which* returns an RDD[T, Long] instead of a map.*/def countByKey(): Map[K, Long] = self.withScope {self.mapValues(_ => 1L).reduceByKey(_ + _).collect().toMap}复制代码scala> val c = sc.parallelize(List("fig", "wc", "sbb", "zq","xwc","dcp"), 2) c: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[36] at parallelize at <console>:24 scala> val d = c.keyBy(_.length) d: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[37] at keyBy at <console>:25 scala> d.countByKey res15: scala.collection.Map[Int,Long] = Map(2 -> 2, 3 -> 4)复制代码

aggregate

/*** Aggregate the elements of each partition, and then the results for all the partitions, using* given combine functions and a neutral "zero value". This function can return a different result* type, U, than the type of this RDD, T. Thus, we need one operation for merging a T into an U* and one operation for merging two U's, as in scala.TraversableOnce. Both of these functions are* allowed to modify and return their first argument instead of creating a new U to avoid memory* allocation.** @param zeroValue the initial value for the accumulated result of each partition for the* `seqOp` operator, and also the initial value for the combine results from* different partitions for the `combOp` operator - this will typically be the* neutral element (e.g. `Nil` for list concatenation or `0` for summation)* @param seqOp an operator used to accumulate results within a partition* @param combOp an associative operator used to combine results from different partitions*/def aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): U = withScope {// Clone the zero value since we will also be serializing it as part of tasksvar jobResult = Utils.clone(zeroValue, sc.env.serializer.newInstance())val cleanSeqOp = sc.clean(seqOp)val cleanCombOp = sc.clean(combOp)val aggregatePartition = (it: Iterator[T]) => it.aggregate(zeroValue)(cleanSeqOp, cleanCombOp)val mergeResult = (index: Int, taskResult: U) => jobResult = combOp(jobResult, taskResult)sc.runJob(this, aggregatePartition, mergeResult)jobResult}复制代码

fold

/*** Aggregate the elements of each partition, and then the results for all the partitions, using a* given associative function and a neutral "zero value". The function* op(t1, t2) is allowed to modify t1 and return it as its result value to avoid object* allocation; however, it should not modify t2.** This behaves somewhat differently from fold operations implemented for non-distributed* collections in functional languages like Scala. This fold operation may be applied to* partitions individually, and then fold those results into the final result, rather than* apply the fold to each element sequentially in some defined ordering. For functions* that are not commutative, the result may differ from that of a fold applied to a* non-distributed collection.** @param zeroValue the initial value for the accumulated result of each partition for the `op`* operator, and also the initial value for the combine results from different* partitions for the `op` operator - this will typically be the neutral* element (e.g. `Nil` for list concatenation or `0` for summation)* @param op an operator used to both accumulate results within a partition and combine results* from different partitions*/def fold(zeroValue: T)(op: (T, T) => T): T = withScope {// Clone the zero value since we will also be serializing it as part of tasksvar jobResult = Utils.clone(zeroValue, sc.env.closureSerializer.newInstance())val cleanOp = sc.clean(op)val foldPartition = (iter: Iterator[T]) => iter.fold(zeroValue)(cleanOp)val mergeResult = (index: Int, taskResult: T) => jobResult = op(jobResult, taskResult)sc.runJob(this, foldPartition, mergeResult)jobResult}复制代码

转载于:https://juejin.im/post/5cfdf178e51d454d1d6284e8

总结

以上是生活随笔为你收集整理的Spark学习之Spark RDD算子的全部内容,希望文章能够帮你解决所遇到的问题。

如果觉得生活随笔网站内容还不错,欢迎将生活随笔推荐给好友。