While the combination of Pandas and Scikit-learn are very popular and easy to use for data processing and machine learning in Python, they cannot handle Big Data. Pandas can handle as much data as the RAM can take. What if the data is big (even just in orders of GB)? Then, PySpark is a great solution. PySpark is the Python API for Spark. It can process data in SQL-like language, and it comes with machine learning libraries as well.
At the end of the workshop, participants will be able to use Pyspark for Big Data processing and machine learning.