Using PySpark and interactive Jupyter notebook on Amazon Clusters
13:30-16:30 January 26

½ day
Intermediate level

@ 3BC


Working with Big Data sometimes requires access to remote distributing systems such as Amazon or Google Cloud services. In this workshop, I will be showing how you can set up PySpark on Amazon Elastic Map Reduce (EMR) and do interactive data processing and machine learning on EMR from a Jupyter notebook on your local computer.


At the end of the workshop, participants will be able to use Pyspark for data processing and machine learning on Amazon EMR. They also learn how to set up an interactive Jupyter notebook to connect with Amazon EMR clusters.


  • Important: Please create an AWS account before the workshop. Note that even though you will get some free usage, you need to provide your credit card info in the AWS registration process. If you are a student, you can register for an AWS educate account in which case you will have more free tier and may not need to provide your credit card information. Please beware that the verification of student accounts may take up to 48 hours.
  • Know how to use PySpark (or have already participated in the PySpark: Big Data Processing and Machine Learning with Python workshop)
  • Please download all data from here to save time during the workshop


Hamed Razavi

Scientist, EPFL