Hasso-Plattner-Institut
Prof. Dr. Felix Naumann
 

Data Integration

Data integration is the merging of heterogeneous information from various data sources to a homogenous, clean dataset. Despite research and development over the past 40 years, collecting and integrating data from multiple sources remains an important and challenging task in any data-oriented or data science project. This lecture covers the basic technologies, such as distributed database architectures, techniques for virtual and materialized integration, data profiling, and data cleansing technologies. It thus combines the previous foundational lectures on information integration and data profiling to lay a foundation for handling unknown data.

Further Information:

  • Lectures will be given in English.
  • Please enroll yourself in our Moodle course by April 18, because we will use it to coordinate the exercises.

Exercises:

  • The exercises are led by Sebastian Schmidl.
  • It is necessary to pass the exercise to be admitted to take the exam.
  • In the exercises, you will work in teams of two on four different assignments: two about data profiling (multivalued dependencies) and two about data cleaning (duplicate detection).
  • We will probably use five of the normal lecture slots to introduce and discuss the exercises.

Schedule

The course will take place Mondays at 13:30 and Thursdays at 13:30 in L-E.03. Some lectures will have the form of exercises.

DateTopic
Mon 8.4.2024Introduction
Thu 11.4.2024Introduction
Mon 15.4.2024Distribution, autonomy, and heterogeneity
Thu 18.4.2024Exercise 1: Data Profiling - Validation of Multivalued Dependencies (Publication Sheet 1)
Mon 22.4.2025Adornments and Data Structures
Thu 25.4.2024Data Profiling Introduction
Mon 29.4.2024Unique Column Combinations and Keys
Wed 1.5.2024 (23:59)Deadline Sheet 1! Publication Sheet 2
Thu 2.5.2024Integration architectures
Mon 6.5.2024Exercise 2: Data Profiling - Discovery of Multivalued Dependencies
Thu 9.5.2024Ascension
Mon 13.5.2024Integration architectures (Dr. Fabian Panse)
Thu 16.5.2024 
Mon 20.5.2024Pentecost
Thu 23.5.2024 
Sun 26.5.2024 (23:59)Deadline Sheet 2, Publication Sheet 3?
Mon 27.5.2024 
Thu 30.5.2024 
Mon 3.6.2024Exercise 3?
Thu 6.6.2024 
Mon 10.6.2024*
Thu 13.6.2024*
Sun 16.6.2024 (23:59)Deadline Sheet 3, Publication Sheet 4?
Mon 17.6.2024 
Thu 20.6.2024 
Mon 24.6.2024Exercise 4?
Thu 27.6.2024 
Mon 1.7.2024 
Thu 4.7.2024 
Sun 7.7.2024 (23:59)Deadline Sheet 5
Mon 8.7.2024 
Thu 11.7.2024 
Mon 15.7.2024Exercise 5: Results and Exam Prep?
Thu 18.7.2024 

The exam is scheduled for TBD.

Literature

Throughout the lecture, I will refer to various scientific papers, that serve as in-depth references.

Exam

Lecture grading is based 100% on the written exam (approx. 3h) after the end of the teaching period. Requirements for the exam admission are:

  • "Passing" all four exercises
  • At least one short presentation of an exercise solution