CS54100: Database Systems

Instructor: Mohammad Sadoghi
E-mail: [email protected]
Office: Lawson 2116N
Office Hours: By email appointments

Teaching Fellow: Spencer S Pearson
Email: [email protected]
Office: HAAS G50
Office Hours: 3:00-4:00pm (Tue) & 10:30-11:30am (Fri)

Time: MWF 9:30 am - 10:20 am
Location: Lawson Computer Science Building, Room LWSN-1106


Overview

This graduate course provides an introduction to the design and development of fundemental concepts in relational database management systems (DBMS). You will learn the theory behind database systems, the issues that affect their functionality and performance, and most importantly, what it takes to build the engine of a relational database management system and to explore the role of modern data processing platforms, e.g., Apache Spark/Hadoop. There is a significant component of Java/SQL/Scala development. There will also be hands-on exercises using the Oracle DBMS and Spark/Hadoop/HDFS. The list of topics covered in this course includes:

  • DBMS Concepts and Architecture
  • Relational Databases, Relational Algebra and Calculus
  • Query Languages (SQL)
  • Storage and Indexing
  • Query Evaluation and Optimization
  • Database Design, the E-R Model, Normalization, and Tuning
  • Database Security
  • Management, Concurrency Control, and Recovery
  • Parallel and Distributed Databases
  • New Trends in Data Systems (e.g., NoSQL Databases)

Syllabus

Textbooks:

Required:
  • "Database Management Systems", 3rd Edition. Raghu Ramakrishnan and J. Gehrke. McGraw Hill, 2003, ISBN 0-07-246563-8.
  • Additionally, a list of research papers will be added later as part of the required reading for the advanced topics.
Optional:
  • "Oracle 10g Programming: A Primer". R. Sunderraman, Addison-Wesley. 2006
  • "Readings in Database Systems", 4th Edition. Joseph M. Hellerstein, Michael Stonebraker. 2005
  • "The Elements of Style", 4th Edition. William Strunk Jr. and E. B. White. 1999
  • "Style: Toward Clarity and Grace". Joseph M. Williams. 1995

Workload:

There will be around four-five programming assignments. In each assignment, you will be asked to implement one component of a simple database management system. There will be an SQL programming assignment using Oracle. There will be around three homeworks, a final project, and a final exam.

In the final project, you will be asked to implement your favorite optimization strategies to improve query execution on Spark and/or Hadoop platforms. Additionally, you are required to submit a report that describes the employed techniques, evaluation methodology, and experimental results.

Programming Assignment:

The programming assignments are an important part of the course, and will involve a significant amount of Java programming. The first programming assignment will be SQL programming assignments using Oracle. The remaining assignments will be performed in teams of two. The purpose is for each team to build parts of a working single-user relational database management system. You will start almost from scratch -- a few basic components may be provided to you. By the end of the course, you will have built a simple DBMS by completing four separate assignments. In most of the assignments, you will be given Java class definitions with templates. You will need to actually implement the functions. Implementing the various interfaces involves several hundred lines of code. At the end of the programming assignments, you will understand the basic relational DBMS concepts because you have implemented them. Each of your assignments builds upon the code written in the previous assignments.

Unless extremely necessary, you should not change partners for programming assignments. You need to have the instructor’s prior approval to switch partners.

Final Project:

The final project is a crucial part of the course that prepares to undertake a small research project on modern database systems. It will provide an opportunity for you to apply the lessons that your learned throughout the course to a research problem of your choice. Thus, the final project is open-ended, but, it must* contain the following core elements:

  • studying and comparing the performance of Hadoop and Spark based on a standard database benchmark (TPC-H or BigBench) in order to understand the role of disk vs. main memory for query performance
  • exploring different query processing strategies over Spark and/or Hadoop (such as data partitioning/replication, materialized view, join reordering, and index selection)
  • writing a detailed (but concise) report that describes your approach, evaluation methodology, and experimental results (i.e., similar to a short research paper).
Note that the final project will involve a moderate amount of SQL/Java/Scala programming. The projects will be performed in teams of two. You may choose a different partner for the final project or you may continue working with the same partner that you had in the programming assignments. Unless extremely necessary, you should not change partners for the final project. You need to have the instructor’s prior approval to switch partners.

You may start your final project at anytime (it is recommened that you start as early as possible). The report for the final project is due on December 1, 2016.

*You may discuss with the instructor the opportunity to propose an alternative final project that is directly related to modern database systems. You need to have the instructor’s prior approval before start working on your own-defined final project proposal.

Grading (tentative):

The final grade will be based upon the following:
  • Homework Assignments: 20%
  • Programming Assignments: 30%
  • Final Project: 20%
  • Final Exam: 30%
    (Extra-credit can be earned in the final project)

Course Policy:

In this class, we adopt the following policy: http://spaf.cerias.purdue.edu/cpolicy.html

In addition, late submission will result in a 10% penalty for each day late. Students are strongly advised that any act of cheating will result in a score of 0 for the entire assignment and repeat offences will be reported to the Office of the Dean of Students and will result in an automatic F grade. You are encouraged to discuss problems and ideas but the final solution or code must be your own. In the event of a major campus emergency, course requirements, deadlines, and grading percentages are subject to changes that may be necessitated by a revised semester calendar. If such unusual circumstances arise, students may determine any such changes by contacting me via email ([email protected]rdue.edu).


Contents

Class Schedule (tentative):

  1. Introduction to Database Systems (Chapter 1)
    • The Entity-Relationship Model (Chapter 2)
    • The Relational Model (Chapter 3)
    • Advanced Topics: Overview of Modern Distributed Data Processing Systems (Hadoop, Spark)
  2. Database Query Languages
  3. Storage and Indexing
  4. Query Evaluation
  5. Database Design & Tuning
  6. Transaction Management
    • Transaction Properties (Chapter 18)
    • Concurrency Control (Chapter 19)
    • Crash Recovery (Chapter 20)
    • Advanced Topics: In-memory Databases (Microsoft Hekaton: Paper)
    • Advanced Topics: Multi-version Optimistic Concurrency Control (2VCC: Paper)
    • Advanced Topics: Distributed Deterministic Concurrency Control (Calvin: Paper, Distributed Workflow: Paper)
    • Advanced Topics: Avoiding Distributed Coordination Using Partitioning & Replication (Schism: Paper)
  7. Index & Storage Management
    • Advanced Topics: Index Maintenance (Indirection: Paper)
    • Advanced Topics: Resilient Distributed Datasets (RDDs: Paper, Slides)
    • Advanced Topics: Lineage-based Storage Architecture (L-Store: Paper, DL-Store: Paper)


Announcements

December 12, 2016: Just a friendly reminder, the final exam will be tomorrow on December 13 from 10:30-12:30pm in the room LWSN B155. Good luck everyone!

December 12, 2016: It is my pleasure to announce that we had an exciting and diverse set of final projects this year. The topics spanned

    • Deep Embedded Join (John Moore and Evan Hanau)
    • Supporting Semantic Queries in Spark using Word Embedding (Alina Nesen)
    • Analysis of TPC-C Benchmark using RedBlue Consistency Model (Fei Wang and Xingang Wang)
    • Fine-grained, Mutable Access on Taychon (Aman Preet Singh and Nikita Gupta)
    • Extending L-Store to Support Temporal and Dynamic Graphs (Chih-Hao Fang and I-Ta Lee)
    • Data Skipping Using Synopsis/Bitmaps in Spark (Anshu Maheshwari, Ishan Chawla, and Madhav Kapoor)
    • Data Skipping Using Synopsis in Spark (Aakanksha Choudhary and Vedant Mishra)
    • Sparkyon: Cost-based View Materialization in Spark through Tacyhon (Prashant Ravi and Henry Jebasingh Elilarasu)
    • Query Optimization in SparkSQL (Anil Kumar Reddy Pulakanti and Sai Chowdary Samineni)
    • Materialized View Selection Using Integer Programming (Tao Jiang and Yu Qiao)
    • Materialized View Selection Using Dynamic Programming (Chengzhang Li and Zhongjie Ma)
    • Accelerating SparkSQL and Hive Using Materialized Views (Samodya Abeysiriwardane and Karan Prabhu)
    • Evaluating Join Re-ordering Strategies in Hive (Emily Kossler and Ayush Parolia)
    • Building a Query Optimizer for Map-Reduce (Sowmya Rupa Siddareddy and Sneha Balasubramanian)
    • Hadoop vs. Spark: A Comparative Study (Alex Chirayath and Kartik Killawala)
    • Analyzing Hive and SparkSQL using TPC-H Benchmark (Guangtong Shen and Haizhou Mo)
November 17, 2016: The deadline for the final project has been extended to Friday, December 9 at the beginning of class (a firm deadline). Late projects will not be accepted! Please submit a hardcopy of your final project report (in class) and email an electronic version (including all your code, test cases, evaluation, report, etc.) to the instructor.

November 17, 2016: Programming Assignment 4 is out. It is due on Wednesday, December 7 on Blackboard (electronic submission). There will be a 10% penalty for each late day. After 5 late days, the homework will not be accepted.

November 11, 2016: Homework 3 is out. It is due on Monday, December 2 at the beginning of class. There will be a 10% penalty for each late day. After 5 late days, the homework will not be accepted.

October 31, 2016: Programming Assignment 3 is out. It is due on Wednesday, November 16 on Blackboard (electronic submission). There will be a 10% penalty for each late day. After 5 late days, the homework will not be accepted.

October 14, 2016: Homework 2 is out. It is due on Monday, October 31 at the beginning of class. There will be a 10% penalty for each late day. After 5 late days, the homework will not be accepted.

October 5, 2016: Programming Assignment 2 is out. It is due on Tuesday, October 25 on Blackboard (electronic submission). There will be a 10% penalty for each late day. After 5 late days, the homework will not be accepted.

October 3, 2016: To further help you progress on the final project, we are providing a detailed handout that describes how to run a subset of TPC-H queries on Hive and SparkSQL. To access our Hadoop/Spark cluster, please read the cluster setup handout.

September 23, 2016: Please submit a one-page final project proposal (as a team) by Friday, October 7 at the beginning of the class.

September 9, 2016: Programming Assignment 1 is out. It is due on Friday, September 30 on Blackboard (electronic submission). There will be a 10% penalty for each late day. After 5 late days, the homework will not be accepted.

September 9, 2016: Homework 1 is out. It is due on Friday, September 23 at the beginning of class. There will be a 10% penalty for each late day. After 5 late days, the homework will not be accepted.

September 7, 2016: To help you get started on the final project, we have set up a Hadoop/Spark cluster. For more details, please read the cluster setup handout. The final project deadline is on December 1, 2016.

September 4, 2016: We will have a guest lecture by ExxonMobil on Wednesday, September 6 titled "ExxonMobil Databases and Analytics – Unleash your Data Potential".

August 22, 2016: Welcome to CS 541.

Handouts

Course materials/grades will be made available on your Blackboard account.