Course of the tract Data Science of MSIAM and MOSIG master of Universite Grenoble Alpes.
Team
- J. Malick
- F. Iutzeler
- Y. Vernaz
- A. Elrheddane
Contents
- Course
- Introduction to convex optimization: concepts in convex analysis (duality, proximal operators), how to identify potential difficulties in optimization problems. Illustrations in supervised learning (classification and regression problems) and in operation research (decomposition methods
- Algorithms in convex optimization (gradient, proximal gradient, conditional gradient, ADMM)
- Introduction to distributed computation (architectures for computation, map-reduce scheme, MPI, Spark) + practical work
- Distributed optimisation algorithms, stochastic algorithms, asynchronous methods
- Tutorials
- Introduction to Spark
- Take a look at what a tutorial notebook looks like 00_outline and 01_preliminaries
- Go to the 'Tutorial' tab to get started
- Sparse logistic regression in high dimension
- Application to a recommendation system
- Introduction to Spark
Grades
The final grade will be a convex combination of the grade on the report on the practical sessions and the grade of the presentation of a recent research article.- Report on the practical sessions.
We would like you to write a report on the two sessions "sparse regression" and "matrix completion", by groups of 1, 2, or 3 students. The format of the report is free; we expect between 2 and 7 pages, presenting an overview of your work with a focus on a (or several) specific aspect(s).
We do not expect you to give all the answers, question by question. We do not expect either you to cover all the material of the two sessions. On the other hand, you can work out other developments.
You can emphasize any aspect of your work, depending on your personal interests and skills, for instance:- implementation and numerical tests (further developments, more experiments,...)
- applications in learning or statistics (interpretation of results, other models, other datasets...)
- theoretical or mathematical questions (convergence proof of algorithms, convergence rates, advanced versions, theoretical analysis of special case...)
The quality of presentation and of the analysis will obviously matters for the grade.
- Presentation of research articles.
We would like you to present an article by groups of 1, 2, or 3 students. The article has to be chosen in a list that we will given Monday, Dec. 12. The list contains various articles around the topics of the course: some are more theoretical, some are more algorithmic, others deal with applications. The presentation will be short: 8 mins + around 5 mins of questions. In this short time, you can present an overview of the article or put an emphasis on a specific aspect that you find interesting. The slides (in pdf) will be projected from our machine (if you want to present an implementation or a script run, you should prepare slides on it). The presentation will be on Friday, Jan. 13. The presentation slides should be sent the day before (Jan 12).
Subjects and material
- Introduction
- Introduction slides: spark_introduction.pdf
- Tutorial 1: Preliminaries
- Notebook: 01_preliminaries
- Datasets: data
- Tutorial 2: Logistic Regression for Binary Classification
- Notebook: 02_classification
- Datasets: data
- Tutorial 3: Matrix Factorization for Recommender Systems
- Notebook: 03_recommender
- Datasets: data
Reports
- Report on the practical sessions.
We would like you to write a report on the two sessions "sparse regression" and "matrix completion", by groups of 1, 2, or 3 students. The format of the report is free; we expect between 2 and 7 pages, presenting an overview of your work with a focus on a (or several) specific aspect(s).
We do not expect you to give all the answers, question by question. We do not expect either you to cover all the material of the two sessions. On the other hand, you can work out other developments.
You can emphasize any aspect of your work, depending on your personal interests and skills, for instance:- implementation and numerical tests (further developments, more experiments,...)
- applications in learning or statistics (interpretation of results, other models, other datasets...)
- theoretical or mathematical questions (convergence proof of algorithms, convergence rates, advanced versions, theoretical analysis of special case...)
The quality of presentation and of the analysis will obviously matters for the grade.
Setup
Recommended: Install docker on your machine
(On Ubuntu, the package is called docker.io, check out installation instructions)
- Check your install by running
docker run hello-world
- Launch the image used in the tutorials
docker run -p 8888:8888 -p 4040:4040 -v $(pwd)/notebook:/home/jovyan/work agarg0/pyspark2-notebook:latest
- The first time your un this command, the image will be pull which requires some time for download (approx. 2Go)
- A folder called notebook will be created, this will be your working space.
Depending on you install you might have permission issues to write files in the notebook folder, you can remedy that by using the sudoer command "sudo chmod -R 777 ./notebook". - Download and Extract the datasets and notebooks in this folder. Open you browser at localhost:8888 to open Jupyter, you should see the downloaded notebooks, and be able to modify and code inside them.
at Ensimag:
- Enter "lance-vm-pyspark-notebook.sh" in a terminal.
This will launch a VirtualBox virtual machine with CentOS7 which will be your environment for the Labs. You can put the virtual machine in full screen by typing Right Ctrl. + F (same thing to get back to windowed mode). - You will be automatically logged in as user ensimag with adequate permissions.
Notably, you will be in the sudoers and docker groups. - Follow the same steps as above in the virtual machine:
- Lauch docker by running the following command in a terminal "
docker run -p 8888:8888 -p 4040:4040 -v $(pwd)/notebook:/home/jovyan/work agarg0/pyspark2-notebook:latest
" - Once running, open another terminal and run "sudo chmod -R 777 ./notebook" to grant yourself permission in the notebook folder.
- Download and Extract the datasets and notebooks in this folder. Open you browser at localhost:8888 to open Jupyter, you should see the downloaded notebooks, and be able to modify and code inside them.
- Lauch docker by running the following command in a terminal "
- Warning: The virtual machine is non-persistant meaning that the files, in particular the notebooks, will be erased at each reboot of the computer. Thus, to the difference of working on your own machine, you will have to save your completed notebooks by e.g. mailing them to you and your groupmates.
Alternative[Experimental]: Ask us for the USB key of a VirtualBox image.
- Launch VirtualBox for a Linux Ubuntu x64 system and use the image as its hard drive
- Username: cdo Password: LabUga2016
- The system is an Ubuntu distribution and Docker is installed. Proceed as above
Article presentation
- Link to the article list: here
- Presentation of research articles.
We would like you to present an article by groups of 1, 2, or 3 students. The article has to be chosen in a list that we will given Monday, Dec. 12. The list contains various articles around the topics of the course: some are more theoretical, some are more algorithmic, others deal with applications. The presentation will be short: 8 mins + around 5 mins of questions. In this short time, you can present an overview of the article or put an emphasis on a specific aspect that you find interesting. The slides (in pdf) will be projected from our machine (if you want to present an implementation or a script run, you should prepare slides on it). The presentation will be on Friday, Jan. 13. The presentation slides should be sent the day before (Jan 12) at cdo.grenoble@gmail.com.
Team
- J. Malick
- F. Iutzeler
- Y. Vernaz
- A. Elrheddane