PyData Amsterdam 2023

Tables as Code: The Journey from Ad-hoc Scripts to Maintainable ETL Workflows at Booking.com
09-14, 12:00–12:30 (Europe/Amsterdam), Foo (main)

Until a few years ago, data science & engineering at Booking.com had grown largely in an ad-hoc manner. This growth has led to a labyrinth of unrelated scripts representing Extract-Transform-Load (ETL) processes. Without options for quickly testing cross-application interfaces, maintenance and contribution grew unwieldy, and debugging in production was a common practice.

Over the past several years, we’ve spearheaded a transition from isolated workflows to a well-structured community-maintained monorepo - a task that required not just technical adaptation, but also a cultural shift.

Central to this transformation is the adoption of the concept of "tables as code", an approach that has changed the way we write ETL. Our lightweight PySpark extension represents table metadata as a Python class, exposing data to code, and enabling efficient unit test setup and validation.

In this talk, we walk you through “tables as code” design and complementary tools such as efficient unit testing, robust telemetry, and automated builds using Bazel. Moreover, we will cover the transformation process, including enabling people with non-engineering backgrounds to create fully tested and maintainable ETL. This includes internal training, maintainers, and support strategies aimed at fostering a community knowledgeable in best practices.


Until a few years ago, data science & engineering at Booking.com had grown largely in an ad-hoc manner. This growth has led to a labyrinth of unrelated scripts representing Extract-Transform-Load (ETL) processes. Without options for quickly testing cross-application interfaces, maintenance and contribution grew unwieldy, and debugging in production was a common practice.

Over the past several years, we’ve spearheaded a transition from isolated workflows to a well-structured community-maintained monorepo - a task that required not just technical adaptation, but also a cultural shift.

Central to this transformation is the adoption of the concept of "tables as code", an approach that has changed the way we write ETL. Our lightweight PySpark extension represents table metadata as a Python class, exposing data to code, and enabling efficient unit test setup and validation.

In this talk, we walk you through “tables as code” design and complementary tools such as efficient unit testing, robust telemetry, and automated builds using Bazel. Moreover, we will cover the transformation process, including enabling people with non-engineering backgrounds to create fully tested and maintainable ETL. This includes internal training, maintainers, and support strategies aimed at fostering a community knowledgeable in best practices.

This talk is aimed at ETL-adjacent data science practitioners, ideally who have been wondering how to push code quality forward at a data-centric organization.

Introduction (0-5 minutes): We begin by shedding light on the infrastructure that hosted the old scripts, and discuss our motivation for change. It’s worth mentioning that this transformative decision emerged from individual product teams, not from an executive mandate.
Tables as Code (10 minutes): We'll then introduce the concept of 'tables as code', detailing how this approach enables efficient testing.
Monorepo Transformation (10 minutes): Building on this foundation, we'll explore how 'tables as code' grew into a vast monorepo with thousands of tests. We'll discuss how we scaled our processes and nurtured this project as a community effort.
Community Growth and Future Plans (5 minutes): In our closing segment, we'll share insights gained from growing this project as a community, highlight strategies for orchestrating training, community support, and finally, share our future plans both within and outside our organization.


Prior Knowledge Expected

Previous knowledge expected

Bram van den Akker is a Senior Machine Learning Scientist at Booking.com with a background in Computer Science and Artificial Intelligence from the University of Amsterdam. At Booking.com, Bram has been one of the founders of bkng-data, an internal collection of Python tools aimed at improving code quality, testing, and streamlining CI/CD for data practitioners.

Aside from bkng-data, Bram's work focuses on bridging the gap between applied research and practical requirements for Bandit Feedback all across Booking.com. Previously, Bram has held positions at Shopify, Panasonic & Eagle Eye Networks, and has peer reviewed contributions and tutorials to conferences and workshops such as TheWebConf (WWW), RecSys, and KDD, including a best-paper award.

Jon Smith is a Senior Machine Learning Scientist at Booking.com, having spent his time working in fraud detection and performance marketing. In these areas, he focusses on strengthening software practices within critical ML systems, through evangelising code quality and unit testing.

He studied Mathematics and Computer Science at Acadia University and Simon Fraser University in Canada, and spent some time as a Machine Learning Engineer at the Canadian Broadcasting Corporation.