Course Objectives
This course is intended for students who want to understand modern large-scale data analysis systems and database systems. It covers a wide range of topics and technologies, and will prepare students to be able to build such systems as well as use them effectively address analytics and data science challenges.
Go here for the web page of the Spring 2014 edition of this course.
Content
- Map-reduce/Hadoop, GFS/HDFS, Bigtable/HBASE; Spark.
- SQL and relational algebra. Expressing advanced problems as queries. Data-parallel programming. Circuit complexity and its interpretation in data-parallel programming. Monad algebra. NESL, DryadLINQ, PigLatin. Data-flow parallelism vs. message passing. The bulk-synchronous parallel programming model: Pregel.
- Data locality. Memory hierarchies. New hardware. Sequential versus random access to secondary storage. Query operators – join, selection, projection, sorting. Join and sorting algorithms.
- Query optimization. Index selection. Physical database design. Database tuning.
- Parallel & distributed databases: Scaling, partitioning, replication, bloom joins. Massively parallel joins. theta-joins on map-reduce, handling skew; online map-reduce.
- Concurrency control (CC): transactions. SQL isolation levels. Anomalies. Serializability. 2-phase locking. Optimistic CC. Multiversion CC. Snapshot isolation. Distributed transactions. 2-phase commit.
- Eventual consistency. The CAP theorem. NoSQL systems. NewSQL systems.
- OLAP, data cubes. The data warehousing workflow, ETL. Data mining: Frequent itemsets (the a-priori algorithm), association rules. Clustering. Decision tree construction.
- Basics of big data machine learning.
- Realtime analytics: Data stream processing: DSMS and CEP systems. CQL. Window semantics and window joins. Load shedding. Sampling and approximating aggregates (no joins). Querying histograms. Maintaining histograms of streams. Incremental and online query processing: incremental view maintenance: materialized views, delta processing; online aggregation – sampling, ripple joins, error bounding.
Course Project
The project is in the space of large-scale data analysis and will draw together many of the ideas covered in the course. Projects will be worked on by teams of typically 7-10 students.
You are encouraged to propose your own project and recruit a team of students to work on it.
Not every project idea is suitable.
- First, the project should require about the right amount of effort. We can advise you on this.
- It has to be possible to distribute the work in a fair way among the team members.
- Projects will be of two kinds: systems/infrastructure projects (building a big data file system, analytics engine, or such; or a library of scalable machine learning algorithms) or big data apps (the majority of last year’s projects were of this kind).
- In the case of an app, you will need data, and it is an absolute requirement that you convince us that you will be able to get the data you need. In particular, having learned from last year’s project, we will not accept any project that requires access to Twitter data.
We will tell you in the beginning of the course how to propose your own porject. If we consider the project proposal suitable for the course, we will do our best to support you.
In addition, we will make some project suggestions.
Every student has to join exactly one project team.
Project list
- AiiDA extension
by Roger Küng (team leader), Jeremy Rabasco, Artur Skonecki, Alexandre Carlessi, Jocelyn Boullier, Lukas Kellenberger, Martin Duhem, Allan Renucci, and Souleimane Drissi El Kamili
(https://github.com/BIGDATA2015-AIIDA-EXTENSION, wiki)
- Algorithmic Trading
by Kamil Bennani-Smires, Jakob Ehrl, Liu Fengyun, Dennis Meville Meier, Merlin Nimier-David (team leader), Arnaud Robert, Elias Schegg, and Jakub Sygnowski
(https://github.com/merlinND/TradingSimulation/, wiki) - Crosswords
by Patrick Andrade, Johan Berdat (team leader), Timothée Emery, Matteo Filipponi, Grégory Maître, Vincent Mettraux, Utku Sirin, and Laurent Valette
(https://github.com/jojolebarjos/crosswords, wiki) - Crowdsourcing
by Florian Chlan, François Farquet, Joachim Hugonot (team leader), Simon Rodriguez, Kristof Szabo, Florian Vessaz, Guo Xinyi, and Vincent Zellweger
(https://redmine.epfl.ch/projects/bd15_crowdsourced_queries/repository, wiki)
- DeepManuscripts
by Viviana Petrescu, Radu Ionescu, Arttu Voutilainen, Alexander Vostriakov, Benoît Seguin, Isinsu Katircioglu, Arun Balajee Vasudevan, Ashish Ranjen Jha, and Nikolaos Arvanitopoulos (team leader)
(https://github.com/arvanito/DeepManuscripts, final report) - DevSearch
by Christian Zommerfelds (team leader), Damien Engels, Bastien Jacot-Guillarmod, Julien Graisse, Mateusz Golebiewski, Matthieu Rudelle, Nicolas Hubacher, Nicolas Voirol, Pascal Lau, and Pierre Walch
(https://github.com/devsearch-epfl, wiki)
This project was selected by the students of the course as a Audience Choice Award (co-)runner-up! - Event detection
by Laurent Anadon (team leader), Antoine Bastien, Antoine Bodin, Matias Cerchierini, Nina Desnica, Louis Faucon, Damien Hilloulin, Christian Mouchet, and Sami Perrin
(https://github.com/ChristianMct/bigdata-event-stream-detection, wiki)
This project was selected by the students of the course for the Audience Choice Award! - Hypercube
by Khue Vu (team leader), Khayyam Guliyev, Thanh Tam Nguyen, Zisi Wang, and Patrice Gueniat
(https://github.com/khuevu/squall, wiki) - Linguistic drift
by Cynthia Oeschger (team leader), Farah Bouassida, Tao Lin, Jéremy Weber, Nicolas Bornand, Marc Schär, Gil Brechbühler, and Malik Bougacha
(https://github.com/AblionGE/A-study-of-linguistic-drift-on-Le-Temps-Newspaper-Corpus, wiki)
- Temporal profiles
by Sidney Bovet (co-team leader), Valentin Rutz (co-team leader), John Gaspoz, Zhivka Gucevska, Florian Christophe Junker, Mathieu Monney, Joanna Béatrice Salathé, Ana Manasovska, and Fabien Jolidon
(https://github.com/SidneyBovet/smargn, wiki) - Transactional Key-Value Store
by Philémon Favrod (co-team leader), Andy Roulin, Egeyar Özlen Bagcioglu, Florian Vincent Depraz, Georgios Piskas (co-team leader), and Sachin Basil John
(https://github.com/philemonf/transactional-key-value-store-project, wiki) - Weather: Activity recommendation platform based on weather predictions
by <anonymous>, Victor Constantin, Baptiste Vinh Mau, Frédéric Bonnand, Nemanja Drobnjak, Ivan Maslov, Roman Shirochenko (team leader)
(https://github.com/BigData2015Team/bigWeatherData, wiki) - WikiLynx (Wikileaks Data Exploration)
by Florian Briant, Frédéric Jan Ouwehand, Can Güzelhan, Deniz Taneli, Thierry Nyfeler, Martin Engilberge, Ketevani Zaridze, and Hussein Kassir (team leader)
(https://github.com/fouweric/wikileaks-data-analytics, wiki, http://wikilynx.org/ (under construction))
This project was selected by the students of the course as a Audience Choice Award (co-)runner-up!
Required Prior Knowledge
- A basic course on database systems (e.g. covering parts III, IV, and V of Ramakrishnan and Gehrke on storage and indexing, query processing, and concurrency control).
- You absolutely must master SQL, relational algebra, key and foreign key constraints, and the transaction concept.
- Solid programming skills in Java.
- Programming experience in a function programming language (such as Scala) strongly recommended.
- Familiarity with working on a Unix-style operating system.
- Basic knowledge of linear algebra (vector spaces, matrix multiplication), probability theory and statistics, and complexity theory (complexity classes, reductions, completeness, LOGSPACE, P, NP) are required.
Important Information
7 credits: 3 (lectures) + 2 (exercises) + 2 (project). This course is taught in English.
We use MOODLE (go here for the course page). Please sign up in IS-Academia to take the course (this will automatically give you access to moodle).
Plenary dates: Tuesdays 1:15-4pm in CM1. The first plenary will take place on Tuesday Feb. 17, 2015. Exercise dates: Wednesdays 10:15am-noon in INJ218. Project meetings: project teams decide on their own when to meet. (See also this page.)
It is important to attend the plenaries. Physically attending the exercises and project meetings is optional. We partially reverse the classroom: some lectures are provided on video and we use the plenaries for lectures as well as group work, discussions, and hands-on work.
Course staff: Christoph Koch (lecturer); Mohammad Dashti, Mohammed ElSeidy, Renata Khasanova, Amir Shaikhha, Immanuel Trummer, and Aleksandar Vitorovic (teaching assistants).
Office hours are by appointment (our email addresses are firstname.lastname@epfl.ch). Please use classes and the breaks between them to ask questions. Teaching assistants will be present in the exercise sessions.
Getting a Grade
This course uses in-course grading. Attendance of and active participation in the plenaries is very strongly suggested. Attending the exercises is optional but please keep in mind that the TAs spend a lot of time there so please be nice trying to ask your questions there rather than asking for separate appointments. You have to arrange regular project meeting times within your project team; despite what the course catalog says, there is no specific day and time on which these teams must take place.
The grade is determined based on
- 5 homeworks (5 * 4% = 20%). Homework has to be done individually (collaboration is considered cheating) and is to be submitted on Tuesdays by the start of the plenary. Homeworks will usually be due every second week. Tentatively, HW1 is posted on Feb. 24 and due Mar. 10; HW2 is posted Mar. 10 and due Mar. 24, HW3 is posted Mar. 24 and due Apr. 14, HW4 is posted Apr. 14 and due Apr. 28, and HW5 is posted Apr. 28 and due May 12.
- A midterm exam (20%, March 31) and a final exam (30%, May 26). These are held in class (starting at 1:15pm, 120 minutes).
- the course project (30%). The project will be worked on in student teams.
The grade scale is as follows: 6: >= 95%; 5.5: >= 90%; 5: >= 80%; 4.5: >= 65%; 4: >= 50%. Failure below 50%.
Academic Integrity and Group Work
Exams and homeworks, and, unless stated otherwise, project work are to be done individually. (Projects are carried out in teams, but every team member will have some deliverables which will have to be worked on individually.) Collaboration on these will be considered cheating.
All cases of cheating will be taken very seriously, and may lead to your eviction from the university without graduation. We use plagiarism detection software that is not easy to trick.
In case of doubt in academic integrity matters, ask the instructor.