homework | Notion

Week 5 Homework

In this homework we'll put what we learned about Spark in practice.

For this homework we will be using the FHV 2019-10 data found here. FHV Data

Question 1:

Install Spark and PySpark

Install Spark
Run PySpark
Create a local spark session
Execute spark.version.

What's the output?

Answer:

3.5.1

Untitled

Question 2:

FHV October 2019

Read the October 2019 FHV into a Spark Dataframe with a schema as we did in the lessons.

Repartition the Dataframe to 6 partitions and save it to parquet.

What is the average size of the Parquet (ending with .parquet extension) Files that were created (in MB)? Select the answer which most closely matches.