Explain below transformations respective to Dataframe?
Narrow Dependency Transformation
Wide Dependency Transformation
Narrow Dependency Transformation | Wide Dependency Transformation |
---|---|
Narrow Dependency Transformation : A transformation performed independently on a single partition to produce valid results. | Wide Dependency Transformation:A transformation that requires data from other partitions to produce correct results.GrouBy,Join,OrderBy are example of Wide Dependency Transformation. |
Df3 = Df2.filter(Df2.Gender == ‘Male’) | Df3 = Df2.groupBy(country) |
How to read data from column?
from pyspark.sql.functions import *
airlinesDF.select(column(”Origin”), col(’Dest”), airlinesDF.Distance,"year" ).show(1O)
- we can use column,col object ,double quotes within column.
- Consider we have one table,in that there are 3 columns.Day,month,year. Create one table with one column having DATE as in format YYYY-MM-DD.
using expr function
a1r1nesDF.setect(”0r1g1n, Dest’, “Distance”, expr(”to_date(concat(Year,Month,DayofMonth),’yyyyMMdd) as FlightDate”)) .show(1)I
origin dest distance flightdate
1 san sf0 4471 1987-10-14
a1r1nesDF.setect(”0r1g1n, Dest’, “Distance”, to_date(concat("Year","Month","DayofMonth"),’yyyyMMdd').alias("FlightDate”)) .show(1)I
origin dest distance flightdate
1 san sf0 4471 1987-10-14
Convert year in proper format like 81 -> 1981, 11->2011
+—–+—+—–+—-+———–+
| name|day|month|year| id|
+—–+—+—–+—-+———–+
|vikas| 23| 5| 81| 0|
| Ravi| 28| 1|2002| 8589934592|
|vikas| 23| 5| 81| 8589934593|
| John| 12| 12| 6|17179869184|
| kamal| 7| 8| 63|17179869185|
+—–+—+—–+—-+———–+
df3 = df1.withColumn("year", expr("""
case when year < 21 then cast(year as int) + 2000
when year < 100 then cast(year as int) + 1900
else year
end"""))
df3.show()
+-----+---+-----+----+-----------+
| name|day|month|year| id|
+-----+---+-----+----+-----------+
|vikas| 23| 5|1981| 0|
| Ravi| 28| 1|2002| 8589934592|
|vikas| 23| 5|1981| 8589934593|
| John| 12| 12|2006|17179869184|
| kamal| 7| 8|1963|17179869185|
+-----+---+-----+----+-----------+
OR
df4 = df1.withColumn("year", expr("""
case when year < 21 then year + 2000
when year < 100 then year + 1900
else year
end""").cast(IntegerType()))
df4.show()
+-----+---+-----+----+-----------+
| name|day|month|year| id|
+-----+---+-----+----+-----------+
|vikas| 23| 5|1981| 0|
| Ravi| 28| 1|2002| 8589934592|
|vikas| 23| 5|1981| 8589934593|
| John| 12| 12|2006|17179869184|
| kamal| 7| 8|1963|17179869185|
+-----+---+-----+----+-----------+
OR
df5 = df1.withColumn("day", col("day").cast(IntegerType())) \
.withColumn("month", col("month").cast(IntegerType())) \
.withColumn("year", col("year").cast(IntegerType()))
df6 = df5.withColumn("year", expr("""
case when year < 21 then year + 2000
when year < 100 then year + 1900
else year
end"""))
df6.show()
+-----+---+-----+----+-----------+
| name|day|month|year| id|
+-----+---+-----+----+-----------+
|vikas| 23| 5|1981| 0|
| Ravi| 28| 1|2002| 8589934592|
|vikas| 23| 5|1981| 8589934593|
| John| 12| 12|2006|17179869184|
| kamal| 7| 8|1963|17179869185|
+-----+---+-----+----+-----------+
OR
df7 = df5.withColumn("year", \
when(col("year") < 21, col("year") + 2000) \
.when(col("year") < 100, col("year") + 1900) \
.otherwise(col("year")))
df7.show()
+-----+---+-----+----+-----------+
| name|day|month|year| id|
+-----+---+-----+----+-----------+
|vikas| 23| 5|1981| 0|
| Ravi| 28| 1|2002| 8589934592|
|vikas| 23| 5|1981| 8589934593|
| John| 12| 12|2006|17179869184|
| kamal| 7| 8|1963|17179869185|
+-----+---+-----+----+-----------+
Add DOB column at the end of table.
df8 = df7.withColumn(“dob”, expr(“to_date(concat(day,’/’,month,’/’,year), ‘d/M/y’)”))
df8.show()
+—–+—+—–+—-+———–+———-+
| name|day|month|year| id| dob|
+—–+—+—–+—-+———–+———-+
|vikas| 23| 5|1981| 0|1981-05-23|
| Ravi| 28| 1|2002| 8589934592|2002-01-28|
|vikas| 23| 5|1981| 8589934593|1981-05-23|
| John| 12| 12|2006|17179869184|2006-12-12|
| kamal| 7| 8|1963|17179869185|1963-08-07|
+—–+—+—–+—-+———–+———-+