Master Statistique Science des Données (SSD)

Introduction to Python for Data Science


Below are described the proposed projects, you are also free to setup your own. The description of the projects is voluntarily vague and initiative is expected. You are expected to look for information, tips, etc. on your own.

For all projects, you have to (and will be graded on these points):

  • Gather and preprocess data (Python code or notebook)
  • Extract information by data analysis or machine learning (Python code or notebook)
  • Generate visuals illustrating your findings (plots)
  • Present these results (notebook or powerpoint/beamer)

In practice:

  • Groups of at most 4
  • 6h of preparation in class (8/11, 15/11)
  • Presentation (5 min + 5 min questions) 22/11
  • Notebook/code handout 22/11

Project 1: Tree-based classifiers

Tree-based classifiers are classification procedures that determine a class by a succession of tests. For that reason, it is widely used in the industry. However, it raises a number of questions in terms of learning performance.

Example of goals:

  • Investigate a tree-based classifier on the iris dataset then on bigger multi-class decision problems
  • Produce images of the obtained decision trees
  • Produce a tutorial notebook on tuning these classifiers

Project 2: Large Scale learning with Spark

Spark is a computing framework for working on computer cluster. It is based on the fact that most operations can actually be decomposed into local operations on the data (maps) brought together by some function (reduce). This interest of this framework is that it is quite simple to produce and test a code on a personal laptop that will be scalable on a large cluster. Investigate how the machine learning library of Spark, MLlib, is similar to Scikit-Learn.

Example of goals:

  • Install Spark on your personal computer
  • Solve again some problems tackled in the course using Spark's MLlib and compared the coding practice to Scikit-Learn
  • Produce a tutorial notebook on Spark basics

Project 3: MultiClass Prediction

Classifying into more than two classes may be way more involved than in the binary case. Worse, the imbalance between the number of examples in each class may become a serious problem (see the Shuttle binary classification dataset and DMOZ in the multiclass setting). Compare different strategies like Pipelining One versus All, Tree based approaches, PD-Sparse algorithm.

Example of goals:

  • Study the different scores on multiclass classification
  • Compare different methods for different merits
  • Explain you findings in a notebook

Project 4: NetFlix recommendation

Movie recommendations based on user reviews through matrix factorization has been very popular, notably since the NetFlix prize. But sometimes, good results can be obtained by looking at the reviews made by similar users.

Example of goals:

  • Implement a recommender system using matrix factorization, and one using similar users
  • Compare their performances and their running times on typical datasets
  • Layout you conclusion as a notebook

Project 5: Deep Learning with TensorFlow/Keras

In the last years, deep learning methods have become more and more popular especially as they reached mind-blowing precision on machine learning tasks such as image classification. Most frameworks for neural networks are interfaced with Python, the most popular being Google's TensorFlow and former-Google brain's Keras. A typical good case for neural networks is digit recognition.

Example of goals:

  • Install TensorFlow/Keras
  • Generate a code for learning a digit classifier using deep learning architectures
  • Produce a tutorial notebook on how to do that

Project 6: Kaggle

Take a Kaggle problem you find interestign and try to reach a good score. For instance, is fairly doable.

Example of goals:

  • Investigate the problem and the evaluation procedure
  • Produce your own solution and try to improve your score
  • Produce a Kernel showing what and how your reached that precision