Spark2 DataFrame数据框常用操作(三)
import org.apache.spark.sql.functions._
// 对整个DataFrame的数据去重
data.distinct()
data.dropDuplicates()
// 对指定列的去重
val colArray=Array("affairs", "gender")
data.dropDuplicates(colArray)
//data.dropDuplicates("affairs", "gender")
val df=data.filter("gender==‘male‘ ")
// data与df的差集
data.except(df).show
+-------+------+----+------------+--------+-------------+---------+----------+------+
|affairs|gender| age|yearsmarried|children|religiousness|education|occupation|rating|
+-------+------+----+------------+--------+-------------+---------+----------+------+
| 0.0|female|32.0| 15.0| yes| 1.0| 12.0| 1.0| 4.0|
| 0.0|female|32.0| 1.5| no| 2.0| 17.0| 5.0| 5.0|
| 0.0|female|32.0| 15.0| yes| 4.0| 16.0| 1.0| 2.0|
| 0.0|female|22.0| 0.75| no| 2.0| 12.0| 1.0| 3.0|
| 0.0|female|27.0| 4.0| no| 4.0| 14.0| 6.0| 4.0|
+-------+------+----+------------+--------+-------------+---------+----------+------+
// data与df的交集
data.intersect(df)
文章来自:http://www.cnblogs.com/wwxbi/p/6102085.html