PySpark: Big Data Processing and Machine Learning with Python
09:00-12:00 January 26

½ day
Intermediate level


While the combination of Pandas and Scikit-learn are very popular and easy to use for data processing and machine learning in Python, they cannot handle Big Data. Pandas can handle as much data as the RAM can take. What if the data is big (even just in orders of GB)? Then, PySpark is a great solution. PySpark is the Python API for Spark. It can process data in SQL-like language, and it comes with machine learning libraries as well.


At the end of the workshop, participants will be able to use Pyspark for Big Data processing and machine learning.


  • be familiar with Pandas and Scikit-learn libraries of Python
  • laptop with Jupyter notebook or Jupyterlab already installed
  • it is highly recommended that participants install Pyspark in advance (recommended way of installation is via Anaconda). 


Hamed Razavi

Scientist, EPFL