Master Statistique Science des Données (SSD)

Introduction to Python for Data Science


Below are described the proposed projects, you are also free to setup your own. The description of the projects is voluntarily vague and initiative is expected. You are expected to look for information, tips, etc. on your own.

For all projects, you have to (and will be graded on these points):

  • Gather and preprocess data (Python code or notebook)
  • Extract information by data analysis or machine learning (Python code or notebook)
  • Generate visuals illustrating your findings (plots)
  • Present these results (notebook or slides)

In practice:

  • Groups of 2/3 students
  • 6h of preparation in class
  • Presentation (5 min + 5 min questions)
  • Notebook/code handout
  • Several groups can take the same project (try not to overlap/collaborate)

Project 1: Tree-based classifiers

Tree-based classifiers are classification procedures that determine a class by a succession of tests. For that reason, it is widely used in the industry. However, it raises a number of questions in terms of learning performance.

Example of goals:

  • Investigate a tree-based classifier on the iris dataset then on bigger multi-class decision problems
  • Produce images of the obtained decision trees
  • Produce a tutorial notebook on tuning these classifiers

Project 2: MultiClass Prediction

Classifying into more than two classes may be way more involved than in the binary case. Worse, the imbalance between the number of examples in each class may become a serious problem (see the Shuttle binary classification dataset and DMOZ in the multiclass setting). Compare different strategies like Pipelining One versus All, Tree based approaches, etc.

Example of goals:

  • Study the different scores on multiclass classification
  • Compare different methods for different merits
  • Explain you findings in a notebook

Project 3: NetFlix recommendation

Movie recommendations based on user reviews through matrix factorization has been very popular, notably since the NetFlix prize. But sometimes, good results can be obtained by looking at the reviews made by similar users.

Example of goals:

  • Implement a recommender system using matrix factorization, and one using similar users
  • Compare their performances and their running times on typical datasets
  • Layout you conclusion as a notebook

Project 4: Deep Learning with TensorFlow/Keras

In the last years, deep learning methods have become more and more popular especially as they reached mind-blowing precision on machine learning tasks such as image classification. Most frameworks for neural networks are interfaced with Python, the most popular being Google's TensorFlow and former-Google brain's Keras. A typical good case for neural networks is digit recognition.

Example of goals:

  • Install TensorFlow/Keras
  • Generate a code for learning a digit classifier using deep learning architectures
  • Produce a tutorial notebook on how to do that

Project 5: Kaggle

Take a Kaggle problem you find interesting and try to reach a good score. For instance:

are good choices

Example of goals:

  • Investigate the problem and the evaluation procedure
  • Produce your own solution and try to improve your score
  • Produce a Kernel showing what and how your reached that precision

Project 6: Cover a point of the course that has been mostly leftover

There are point in the course that we did not talk too much about. For instance:

  • Cross Validation of Algorithm parameters
  • Fancy visualizations with seaborn
  • Pipelining
  • working with textual data

are good choices

Example of goals:

  • Prepare a quick course on the topic
  • Demonstrate how to use it in practice
  • Show some nice examples