Big Data (Spring 2015)

Course Objectives

This course is intended for students who want to understand modern large-scale data analysis systems and database systems. It covers a wide range of topics and technologies, and will prepare students to be able to build such systems as well as use them effectively address analytics and data science challenges.

Go here for the web page of the Spring 2014 edition of this course.

Content

  1. Map-reduce/Hadoop, GFS/HDFS, Bigtable/HBASE; Spark.
  2. SQL and relational algebra. Expressing advanced problems as queries. Data-parallel programming. Circuit complexity and its interpretation in data-parallel programming. Monad algebra. NESL, DryadLINQ, PigLatin. Data-flow parallelism vs. message passing.  The bulk-synchronous parallel programming model: Pregel.
  3. Data locality. Memory hierarchies. New hardware. Sequential versus random access to secondary storage. Query operators – join, selection, projection, sorting. Join and sorting algorithms.
  4. Query optimization. Index selection. Physical database design. Database tuning.
  5. Parallel & distributed databases: Scaling, partitioning, replication, bloom joins. Massively parallel joins. theta-joins on map-reduce, handling skew; online map-reduce.
  6. Concurrency control (CC): transactions. SQL isolation levels. Anomalies. Serializability. 2-phase locking. Optimistic CC. Multiversion CC. Snapshot isolation. Distributed transactions. 2-phase commit.
  7. Eventual consistency. The CAP theorem. NoSQL systems. NewSQL systems.
  8. OLAP, data cubes. The data warehousing workflow, ETL. Data mining: Frequent itemsets (the a-priori algorithm), association rules. Clustering. Decision tree construction.
  9. Basics of big data machine learning.
  10. Realtime analytics: Data stream processing: DSMS and CEP systems. CQL. Window semantics and window joins. Load shedding. Sampling and approximating aggregates (no joins). Querying histograms. Maintaining histograms of streams. Incremental and online query processing: incremental view maintenance: materialized views, delta processing; online aggregation – sampling, ripple joins, error bounding.

Course Project

The project is in the space of large-scale data analysis and will draw together many of the ideas covered in the course. Projects will be worked on by teams of typically 7-10 students.

You are encouraged to propose your own project and recruit a team of students to work on it.

Not every project idea is suitable.

  • First, the project should require about the right amount of effort. We can advise you on this.
  • It has to be possible to distribute the work in a fair way among the team members.
  • Projects will be of two kinds: systems/infrastructure projects (building a big data file system, analytics engine, or such; or a library of scalable machine learning algorithms) or big data apps (the majority of last year’s projects were of this kind).
  • In the case of an app, you will need data, and it is an absolute requirement that you convince us that you will be able to get the data you need. In particular, having learned from last year’s project, we will not accept any project that requires access to Twitter data.

We will tell you in the beginning of the course how to propose your own porject. If we consider the project proposal suitable for the course, we will do our best to support you.

In addition, we will make some project suggestions.

Every student has to join exactly one project team.

Project list

  • Temporal profiles
    by Sidney Bovet (co-team leader), Valentin Rutz (co-team leader), John Gaspoz, Zhivka Gucevska, Florian Christophe Junker, Mathieu Monney, Joanna Béatrice Salathé, Ana Manasovska, and Fabien Jolidon
    (https://github.com/SidneyBovet/smargn, wiki)
  • Transactional Key-Value Store
    by Philémon Favrod (co-team leader), Andy‎ Roulin, Egeyar Özlen‎ Bagcioglu, Florian Vincent‎ Depraz, Georgios‎ Piskas (co-team leader), and Sachin Basil John
    (https://github.com/philemonf/transactional-key-value-store-project, wiki)
  • Weather: Activity recommendation platform based on weather predictions
    by <anonymous>, Victor Constantin, Baptiste Vinh Mau, Frédéric Bonnand, Nemanja Drobnjak, Ivan Maslov, Roman Shirochenko (team leader)
    (https://github.com/BigData2015Team/bigWeatherData, wiki)
  • WikiLynx (Wikileaks Data Exploration)
    by Florian Briant‎, Frédéric Jan Ouwehand, Can Güzelhan‎, Deniz Taneli‎, Thierry Nyfeler‎, Martin Engilberge‎, Ketevani Zaridze, and Hussein Kassir (team leader)
    (https://github.com/fouweric/wikileaks-data-analytics, wiki, http://wikilynx.org/ (under construction))
    This project was selected by the students of the course as a Audience Choice Award (co-)runner-up!
     

Required Prior Knowledge

  •  A basic course on database systems (e.g. covering parts III, IV, and V of Ramakrishnan and Gehrke on storage and indexing, query processing, and concurrency control).
  • You absolutely must master SQL, relational algebra, key and foreign key constraints, and the transaction concept.
  • Solid programming skills in Java.
  • Programming experience in a function programming language (such as Scala) strongly recommended.
  • Familiarity with working on a Unix-style operating system.
  • Basic knowledge of linear algebra (vector spaces, matrix multiplication), probability theory and statistics, and complexity theory (complexity classes, reductions, completeness, LOGSPACE, P, NP) are required.

Important Information

7 credits: 3 (lectures) + 2 (exercises) + 2 (project). This course is taught in English.

We use MOODLE (go here for the course page). Please sign up in IS-Academia to take the course (this will automatically give you access to moodle).

Plenary dates: Tuesdays 1:15-4pm in CM1. The first plenary will take place on Tuesday Feb. 17, 2015. Exercise dates: Wednesdays 10:15am-noon in INJ218. Project meetings: project teams decide on their own when to meet. (See also this page.)

It is important to attend the plenaries. Physically attending the exercises and project meetings is optional. We partially reverse the classroom: some lectures are provided on video and we use the plenaries for lectures as well as group work, discussions, and hands-on work.

Course staff: Christoph Koch (lecturer); Mohammad Dashti, Mohammed ElSeidy, Renata Khasanova, Amir Shaikhha, Immanuel Trummer, and Aleksandar Vitorovic (teaching assistants).

Office hours are by appointment (our email addresses are firstname.lastname@epfl.ch). Please use classes and the breaks between them to ask questions. Teaching assistants will be present in the exercise sessions.

Getting a Grade

This course uses in-course grading. Attendance of and active participation in the plenaries is very strongly suggested. Attending the exercises is optional but please keep in mind that the TAs spend a lot of time there so please be nice trying to ask your questions there rather than asking for separate appointments. You have to arrange regular project meeting times within your project team; despite what the course catalog says, there is no specific day and time on which these teams must take place.

The grade is determined based on

  • 5 homeworks (5 * 4% = 20%). Homework has to be done individually (collaboration is considered cheating) and is to be submitted on Tuesdays by the start of the plenary. Homeworks will usually be due every second week. Tentatively, HW1 is posted on Feb. 24 and due Mar. 10; HW2 is posted Mar. 10 and due Mar. 24, HW3 is posted Mar. 24 and due Apr. 14, HW4 is posted Apr. 14 and due Apr. 28, and HW5 is posted Apr. 28 and due May 12.
  • A midterm exam (20%, March 31) and a final exam (30%, May 26). These are held in class (starting at 1:15pm, 120 minutes).
  • the course project (30%). The project will be worked on in student teams.

The grade scale is as follows: 6: >= 95%; 5.5: >= 90%; 5: >= 80%; 4.5: >= 65%; 4: >= 50%. Failure below 50%.

Academic Integrity and Group Work

Exams and homeworks, and, unless stated otherwise, project work are to be done individually. (Projects are carried out in teams, but every team member will have some deliverables which will have to be worked on individually.)  Collaboration on these will be considered cheating.

All cases of cheating will be taken very seriously, and may lead to your eviction from the university without graduation. We use plagiarism detection software that is not easy to trick.

In case of doubt in academic integrity matters, ask the instructor.