【spark】常用转换操作:keys 、values和mapValues
1.keys
功能:
返回所有键值对的key
示例
val list = List("hadoop","spark","hive","spark") val rdd = sc.parallelize(list) val pairRdd = rdd.map(x => (x,1)) pairRdd.keys.collect.foreach(println)
结果
hadoop spark hive spark list: List[String] = List(hadoop, spark, hive, spark) rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[142] at parallelize at command-3434610298353610:2 pairRdd: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[143] at map at command-3434610298353610:3
2.values
功能:
返回所有键值对的value
示例
val list = List("hadoop","spark","hive","spark") val rdd = sc.parallelize(list) val pairRdd = rdd.map(x => (x,1)) pairRdd.values.collect.foreach(println)
结果
1 1 1 1 list: List[String] = List(hadoop, spark, hive, spark) rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[145] at parallelize at command-3434610298353610:2 pairRdd: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[146] at map at command-3434610298353610:3
3.mapValues(func)
功能:
对键值对每个value都应用一个函数,但是,key不会发生变化。
示例
val list = List("hadoop","spark","hive","spark") val rdd = sc.parallelize(list) val pairRdd = rdd.map(x => (x,1)) pairRdd.mapValues(_+1).collect.foreach(println)//对每个value进行+1
结果
(hadoop,2) (spark,2) (hive,2) (spark,2)
文章来自:https://www.cnblogs.com/zzhangyuhang/p/9001608.html