09-15, 14:20–14:50 (Europe/Amsterdam), Bar
Back in 2018, a blogpost titled "Data's Inferno: 7 circles of data testing hell with Airflow" presented a layered approach to data quality checks in data applications and pipelines. Now, 5 years later, this talk looks back at Data's Inferno and surveys what has changed but also what hasn't in the space of ensuring high data quality.
5 years ago a blog post called "Data's Inferno" (https://medium.com/wbaa/datas-inferno-7-circles-of-data-testing-hell-with-airflow-cef4adff58d8) was written about how to ensure high data quality with Apache Airflow. It suggested using different types of tests as layers to catch issues lurking within the data. These layers included tests for Airflow DAG integrity, mock data pipelines, production data tests, and more. Combining these layers made for a reliable way to filter out incorrect data. Despite the blogpost's age, the ideas are still relevant today. New tools and applications have been developed to help improve data quality as well as new best practices. In this talk, we'll review the layers of Data's Inferno and how they contributed to improving data quality. We'll also look at how new tools address the same concerns. Finally, we'll discuss how we expect and hope the data quality landscape to evolve in the future.
Previous knowledge expected
Daniel van der Ende is a Data Engineer at Xebia Data. He enjoys working on high performance distributed computation with Spark, empowering data scientists by helping them to run their models on very large datasets with high performance. He is an Apache Spark and Apache Airflow contributor and speaker at conferences and meetups.