PyData Amsterdam 2023

To see our schedule with full functionality, like timezone conversion and personal scheduling, please enable JavaScript and go here.
08:00
08:00
60min
Registration & Breakfast
Foo (main)
08:00
60min
Registration & Breakfast
Bar
08:00
60min
Registration & Breakfast
Qux
08:00
60min
Registration & Breakfast
Hello, World! (Tutorials)
09:00
09:00
30min
Opening notes
Foo (main)
09:30
09:30
50min
Keynote "Build and keep your context window"
Vicki Boykis

What can we learn from engineering, the history of machine learning, fantasy books, the early 1990s internet, and art history about how to be successful engineers in the modern-day data landscape ? We’ll learn together in this talk.

Foo (main)
10:30
10:30
80min
Designing a Machine Learning System
Jetze Schuurmans, Roy van Santen

Are you a machine learning practitioner struggling with designing, reasoning, and communicating about ML systems? Then this session is for you! With the industry moving towards end-to-end ML teams to enable them to implement MLOps practices, it is paramount for you to understand ML from a systems perspective. In this hands-on session, you will gain a thorough understanding of the technical intricacies of designing valuable, reliable and scalable ML systems.

Hello, World! (Tutorials)
10:30
30min
Extend your scikit-learn workflow with Hugging Face and skorch
Benjamin Bossan

Discover how to bridge the gap between traditional machine learning and the rapidly evolving world of AI with skorch. This package integrates the Hugging Face ecosystem while adhering to the familiar scikit-learn API. We will explore fine-turing of pre-trained models, creating our own tokenizers, accelerating model training, and leveraging Large Language Models.

Bar
10:30
30min
Forecasting Customer Lifetime Value (CLTV) for Marketing Campaigns under Uncertainty with PySTAN
Raphael de Brito Tamaki

In this talk, we discuss how we can use the python package PySTAN to estimate the Lifetime Value (LTV) of the users that can be acquired from a marketing campaign, and use this estimate to find the optimal bidding strategy when the LTV estimate itself has uncertainty. Throughout the presentation, we highlight the benefits from using Bayesian modeling to estimate LTV, and the potential pitfalls when forecasting LTV. By the end of the presentation, attendees will have a solid understanding of how to use PySTAN to estimate LTV, optimize their marketing campaign bidding strategies, and implement the best Bayesian modelling solution. All of the contents and numbers in this presentation can be found in the shared GIT

Qux
10:30
30min
What the PDEP? An overview of some upcoming pandas changes
Joris Van den Bossche

Last year, the pandas community adopted a new process for making significant changes to the library: the Pandas Enhancement Proposals, aka PDEPs. In the meantime, several of those proposals have been proposed and discussed, and some already accepted. This talk will give an overview of some of the behavioural changes you can expect as a pandas user.

Foo (main)
11:00
11:00
20min
Coffee Break
Foo (main)
11:00
20min
Coffee Break
Bar
11:00
20min
Coffee Break
Qux
11:20
11:20
30min
Causal Inference Libraries: What They Do, What I'd Like Them To Do
Kevin Klein

This talk will explore the Python tooling and ecosystem for estimating conditional average treatment effects (CATEs) in a Causal Inference setting. Using real world-examples, it will compare and contrast the pros and cons of various existing libraries as well as outline desirable functionalities not currently offered by any public library.

Qux
11:20
30min
From Vision to Action: Designing and Deploying Effective Computer Vision Pipelines
Wesley Boelrijk, Jeroen Rombouts

In the world of computer vision, the focus is often on cutting-edge neural network architectures. However, the true impact usually lies in designing a robust system around the model to solve real-world business challenges. In this talk, we guide you through the process of building practical computer vision pipelines that leverage techniques such as segmentation, classification, and object tracking, demonstrated by our predictive maintenance application at Port of Rotterdam. Whether you're an experienced expert seeking production-worthy pipelines or a novice with a background in data science or engineering eager to dive into image and video processing, we will explore the use of open-source tools to develop and deploy computer vision applications.

Bar
11:20
30min
Turning your Data/AI algorithms into full web applications in no time with Taipy
Florian Jacta, Alexandre Sajus

Numerous packages exist within the Python open-source ecosystem for algorithm building and data visualization. However, a significant challenge persists, with over 85% of Data Science Pilots failing to transition to the production stage.

This talk introduces Taipy, an open-source Python library for front-end and back-end development. It enables Data Scientists and Python Developers to create pilots and production-ready applications for end-users.

Its syntax facilitates the creation of interactive, customizable, and multi-page dashboards with augmented Markdown. Without the need for web development expertise (no CSS or HTML), users can generate highly interactive interfaces.

Additionally, Taipy is engineered to construct robust and tailored data-driven back-end applications. Intuitive components like pipelines and data flow orchestration empower users to organize and manage data effectively. Taipy also introduces a unique Scenario Management functionality, facilitating "what-if" analysis for data scientists and end-users.

Foo (main)
12:00
12:00
30min
Reliable and Scalable ML Serving: Best Practices for Online Model Deployment
Ziad Al Moubayed

Working on ML serving for couple of years we learned a lot. I would like to share a set of best practices / learnings with the community

Bar
12:00
30min
Staggered Difference-in-Differences in Practice: Causal Insights from the Music Industry
Nazli M. Alagoz

The Difference-in-Differences (DiD) methodology is a popular causal inference method utilized by leading tech firms such as Microsoft Research, LinkedIn, Meta, and Uber. Yet recent studies suggest that traditional DiD methods may have significant limitations when treatment timings differ. An effective alternative is the implementation of the staggered DiD design. We exemplify this by investigating an interesting question in the music industry: Does featuring a song in TV shows influence its popularity, and are there specific factors that could moderate this impact?

Qux
12:00
30min
Tables as Code: The Journey from Ad-hoc Scripts to Maintainable ETL Workflows at Booking.com
Bram van den Akker, Jon Smith

Until a few years ago, data science & engineering at Booking.com had grown largely in an ad-hoc manner. This growth has led to a labyrinth of unrelated scripts representing Extract-Transform-Load (ETL) processes. Without options for quickly testing cross-application interfaces, maintenance and contribution grew unwieldy, and debugging in production was a common practice.

Over the past several years, we’ve spearheaded a transition from isolated workflows to a well-structured community-maintained monorepo - a task that required not just technical adaptation, but also a cultural shift.

Central to this transformation is the adoption of the concept of "tables as code", an approach that has changed the way we write ETL. Our lightweight PySpark extension represents table metadata as a Python class, exposing data to code, and enabling efficient unit test setup and validation.

In this talk, we walk you through “tables as code” design and complementary tools such as efficient unit testing, robust telemetry, and automated builds using Bazel. Moreover, we will cover the transformation process, including enabling people with non-engineering backgrounds to create fully tested and maintainable ETL. This includes internal training, maintainers, and support strategies aimed at fostering a community knowledgeable in best practices.

Foo (main)
12:00
60min
Uncertainty visualization with ArviZ
Oriol Abril Pla

Learn how to visualize uncertainty in parameters or predictions using mutiple visualizations adapted to your data and task

Hello, World! (Tutorials)
12:30
12:30
60min
Lunch
Foo (main)
12:30
60min
Lunch
Bar
12:30
60min
Lunch
Qux
13:00
13:00
30min
Lunch
Hello, World! (Tutorials)
13:30
13:30
30min
Don’t judge a book by its cover: Using LLM created datasets to train models that detect literary features
Wessel Sandtke

Existing book recommendation systems like Goodreads are based on correlating the reading habits of people. But what if you want a humorous book? Or a book that is set in 19th century Paris? Or a thriller, but without violence?
We build book recommendation systems for Dutch libraries based on more than a dozen features from historical setting, to writing style, to main character characteristics. This allows us to tailor each recommendation to individual readers.

Qux
13:30
30min
Graph Neural Networks for Real World Fraud Detection
Feng Zhao, Tingting Qiao

Fraud is a major problem for financial services companies. As fraudsters change tactics, our detection methods need to get smarter. Graph neural networks (GNNs) are a promising model to improve detection performance. Unlike traditional machine learning models or rule-based engines, GNNs can effectively learn from subtle relationships by aggregating neighborhood information in the financial transaction networks. However, it remains a challenge to adopt this new approach in production.

The goal of this talk is to share best practices for building a production ready GNN solution and hopefully spark your interest to apply GNNs to your own use cases.

Foo (main)
13:30
30min
In-Process Analytical Data Management with DuckDB
Hannes Mühleisen, Mark Raasveldt

DuckDB is a novel analytical data management system. DuckDB supports complex queries, has no external dependencies, and is deeply integrated into the Python ecosystem. Because DuckDB runs in the same process, no serialization or socket communication has to occur, making data transfer virtually instantaneous. For example, DuckDB can directly query Pandas data frames faster than Pandas itself. In our talk, we will describe the user values of DuckDB, and how it can be used to improve their day-to-day lives through automatic parallelization, efficient operators and out-of-core operations.

Bar
13:30
90min
Probabilistic predictions: probabilistic forecasting with sktime and probabilistic regression with skpro
sktime community

Probabilistic predictions are predictions that include some statements about uncertainty of the prediction, e.g., prediction intervals that make statements about a likely range of values that a prediction can take.
This workshop gives an introduction on making probabilistic predictions with the sktime and skpro python packages, for forecasting and supervised regression. Both packages are sklearn-compatible, built using skbase, with composable and modular interfaces.
The presentation includes a practical primer of different types of probabilistic predictions, algorithms and estimators, and evaluation workflows, with python code examples.

Hello, World! (Tutorials)
14:10
14:10
30min
Declarative data manipulation pipeline with Dagster
Riccardo Amadio

Bored of old pipeline orchestrator? Difficult to understand if data is up-to-date? Trouble with development workflow of data pipeline?
Dagster, an open-source tool, offers a unique paradigm that simplifies the orchestration and management of data pipelines.
By adopting declarative principles, data engineers and data scientists can build scalable, maintainable, and reliable pipelines effortlessly.
We will commence with an introduction to Dagster, covering its fundamental concepts to ensure a comprehensive understanding of the material.
Subsequently, we will explore practical scenarios and use cases, with also DBT for empower the power of SQL language.

Minutes 0-5: Explain the design pattern problem of actual data pipeline framework.
Minutes 5-15: Introduction to Dagster and its core concepts.
Minutes 10-25: Practical examples of building declarative data pipelines with Dagster, with also DBT, the power of gRPC server.
Minutes 25-30: Q&A and conclusion.

Bar
14:10
30min
Mind the language: how to monitor NLP and LLM in production
Emeli Dral

How can you evaluate your production models when the data is not structured and you have no labels? To start, by tracking patterns and changes in the input data and model outputs. In this talk, I will give an overview of the possible approaches to monitor NLP and LLM models: from embedding drift detection to using regular expressions.

Qux
14:10
30min
Promptly Evaluating Prompts with Bayesian Tournaments
Andy Kitchen

Pick your next hot LLM prompt using a Bayesian tournament! Get a quick LLM dopamine hit with a side of decision theory vegetables. It's Bayesian Thunderdome: many prompts enter, one prompt leaves.

Foo (main)
14:40
14:40
30min
Snack Break
Foo (main)
14:40
30min
Snack Break
Bar
14:40
30min
Snack Break
Qux
15:10
15:10
30min
Power Users, Long Tail Users, and Everything In Between: Choosing Meaningful Metrics and KPIs for Product Strategy
Alon Nir, Dror A. Guldin

Data scientists in industry often have to wear many hats. They must navigate statistical validity, business acumen and strategic thinking, while also representing the end user. In this talk, we will talk about the pillars that make a metric the right one for a job, and how to choose appropriate Key Performance Indicators (KPIs) to drive product success and strategic gains.

Foo (main)
15:10
30min
PyIceberg: Tipping your toes into the petabyte data-lake
Fokko Driesprong

With Apache Iceberg, you store your big data in the cloud as files (e.g., Parquet), but then query it as if it’s a plain SQL table. You enjoy the endless scalability of the cloud, without having to worry about how to store, partition, or query your data efficiently. PyIceberg is the Python implementation of Apache Iceberg that loads your Iceberg tables into PyArrow (pandas), DuckDB, or any of your preferred engines for doing data science. This means that with PyIceberg, you can tap into big data easily by only using Python. It’s time to say goodbye to the ancient Hadoop-based frameworks of the past! In this talk, you'll learn why you need Iceberg, how to use it, and why it is so fast.

Bar
15:10
30min
Revealing the True Motives of News Readers
Jurriaan Nagelkerke, Vincent Smeets

Every news consumer has needs and in order to build a true bond with your customer it is vital to meet these, sometimes, diverse needs. To achieve this, first of all, it is important to identify the overarching needs of users; the reason why they read news. The BBC conducted research to determine these needs and identified six distinct categories: Update me, Keep me on trend, Give me perspective, Educate me, Divert me, and Inspire me. Their research showed that an equal distribution of content across these user needs will lead to higher customer engagement and loyalty. To apply this concept within DPG Media, we started building our own user needs model. Through various iterations of text labelling, text preparation, model building, fine-tuning and evaluation, we have arrived at a BERT model that is capable of determining the associated user needs based solely on the article text.

Qux
15:10
30min
Unconference: Interviews: Tips and Stories from Both Sides
Kemal Tugrul Yesilbek

You are a data science or a machine learning engineering, and you applied for a position. You thought your interview went well, but still got a negative response... What might went wrong? In this talk, we will explore how things may go wrong from both applicant and interviewer side, and what can you do about it.

Hello, World! (Tutorials)
15:50
15:50
40min
Keynote - Python for Imaging and Artificial Intelligence in Cultural Heritage
Robert G. Erdmann

For many people, a museum is the last place they would expect to find cutting-edge data science, but the world of cultural heritage is full of fascinating challenges for imaging and computation.  The availability of high-resolution imaging, high-speed internet, and modern computational tools allows us to image cultural heritage objects in staggering detail and with a wide array of techniques.  The result, though, is a data deluge: studying single objects like Rembrandt's Night Watch can generate terabytes of data, and there are millions of objects in the world's museums.  

The huge Python ecosystem enables us to build tools to process, analyze, and visualize these data.  Examples include creating the 717 gigapixel (!) image of the Night Watch and reconstructing the painting's long-lost missing pieces using AI; controlling a camera and automated turntable in Jupyter for 3D object photography; revealing hidden watermarks in works on paper using a hybrid physics and deep learning-based ink-removal model; using chemical imaging and convolutional neural networks to see the hidden structure of Rembrandt and Vermeer paintings; and using a webcam or smartphone camera to do real-time similarity search over a database of 2.3 million open-access cultural heritage images at 4 frames per second.

These and several other live demonstrations show how Python is essential in our work to help the world access, preserve, and understand its cultural heritage.

Foo (main)
16:30
16:30
45min
Lightning talks

Lorem ipsum dolor

Foo (main)
17:30
17:30
150min
Kickstart AI sponsored drinks [at Lowlander Botanical Bar]

Drinks hosted by Kickstart AI at Lowlander Botanical Bar and Kitchen [located next to the venue]. Note that there’s only limited spots available so make sure to sign up to secure your spot! Sign up via https://meetup-september14.kickstartai-events.org/

Foo (main)
08:00
08:00
60min
Breakfast
Foo (main)
08:00
60min
Breakfast
Bar
08:00
60min
Breakfast
Qux
08:00
60min
Breakfast
Hello, World! (Tutorials)
09:00
09:00
50min
Keynote "AI Without Dystopia"
Katharine Jarmul

Many of us have heard terms like Data for Good, Ethical Machine Learning, Human-Centric Product Design, but those words also bring forward questions -- if we need "Ethical ML" what is the rest of machine learning? The current conversation around AI Doom paints a picture where AI goes hand-in-hand with dystopian outcomes. In this keynote, we'll explore what AI could look like if at the core, it was led by these ideals. What if distributed, communal machine learning were a central focus? What if privacy and user choice were a part of our everyday machine learning frameworks? What if aid organizations, governments, coalitions helped shape the problems for AI research? Let's ponder these questions and their outcomes together, imagining AI without the potential for dystopia.

Foo (main)
10:00
10:00
30min
A data-driven approach for distributing scarce goods within the REWE retail supply chain
Robert Grimm, Niklas Amberg

Global political circumstances and unpredictable crises such as the Covid-19 pandemic can cause a scarcity of grocery goods within the retail supply chain. We present a data-driven approach to ensure a fair and replicable distribution from the supplier to the retail warehouses at REWE, one of the largest grocery chains in Germany.

Bar
10:00
30min
Innovation in the Age of Regulation: Federated Learning with Flower
Krishi Sharma

With the rise of data privacy concerns around AI in the EU, how can we innovate using AI capabilities despite regulations around consumer data? What tools and features are available to help us build AI in regulated industries? This talk will discuss how we can leverage diverse datasets to build better AI models without ever having to touch the datasets by using a Python library called Flower.

Qux
10:00
30min
Personalization at Uber scale via causal-driven machine learning
Okke van der Wal

In this talk, we outline how we introduced causality into our machine learning models within the core checkout and onboarding experiences globally, thereby strongly improving our key business metrics. We discuss case studies, where experimental data were combined with machine learning in order to create value for our users and personalize their experiences, and we share our lessons learned with the goal to inspire attendees to start incorporating causality into their machine learning solutions. Additionally, we explain how the open source Python package developed at Uber, CausalML, can help others in successfully making the transition from correlation-driven machine learning to causal-driven machine learning.

Foo (main)
10:00
80min
Unlocking the Black Box: A practical guide to finding an alibi for machine learning models
Ramon Perez

Knowledge work is undergoing a transformative journey with machine learning (ML) but the interpretability of the models we interact with is still lagging behind the coolness and hype of the technologies using ML. This workshop seeks to address the gap between the speed at which we use and adopt ML and the pace at which we understand it. During the workshop, we will cover fundamental concepts and techniques of interpretable machine learning, and explore various explainability methods supported by the Alibi Explain library so that you can get started explaining your models. If you've been meaning to dive deeper into the field of interpretable ML, add interpretability to your workflows, find an alibi for your models, or are simply curious about the field, come and join us for a fun 90-minute interactive session on interpretable ML.

Hello, World! (Tutorials)
10:30
10:30
20min
Coffee Break
Foo (main)
10:30
20min
Coffee Break
Bar
10:30
20min
Coffee Break
Qux
10:50
10:50
30min
Achieving developer autonomy on on-premise data clusters using Kubernetes.
Jorrick Sleijster

Maintaining on-premise clusters poses quite a few challenges. One of these challenges is achieving developer autonomy, where developers can deploy applications themselves. This talk will cover how we set up Kubernetes to achieve exactly that.

Qux
10:50
30min
Balancing the electricity grid with multi-level forecasting models
Rik van der Vlist

Join us as we explore the complexities of balancing the electricity grid amidst the rise of renewable energy sources. We’ll discover the challenges in forecasting electricity consumption from diverse industrial resources and the modelling techniques employed by Sympower to achieve accurate forecasts. Gain insights into the trade-offs involved in aggregating data at different hierarchical levels in time series forecasting.

Bar
10:50
30min
Harnessing uncertainty: the role of probabilistic time series forecasting in the renewable energy transition
Alexander Backus

How can probabilistic forecasting accelerate the renewable energy transition? The rapid growth of non-steerable and intermittent wind and solar power requires accurate forecasts and the ability to plan under uncertainty. In this talk, we will make a case for using probabilistic forecasts over deterministic forecasts. We will cover methods for generating and evaluating probabilistic forecasts, and discuss how probabilistic price and wind power forecasts can be combined to derive optimal short-term power trading strategies.

Foo (main)
11:30
11:30
60min
Building a personal search engine with llama-index
Judith van Stegeren, Yorick van Pelt

Wouldn’t it be great to have a Google-like search engine, but then for your own text files and completely private? In this tutorial we’ll build a small personal search engine using open source library llama-index.

Hello, World! (Tutorials)
11:30
30min
Let’s exploit pickle, and `skops` to the rescue!
Adrin

Pickle files can be evil and simply loading them can run arbitrary code on your system. This talk presents why that is, how it can be exploited, and how skops is tackling the issue for scikit-learn/statistical ML models. We go through some lower level pickle related machinery, and go in detail how the new format works.

Qux
11:30
30min
Multimodal Product Demand Forecasting: From pixels on your screen to a meal on your plate
Maarten Sukel

The customers of Picnic use images and texts of products to decide if they like our products, so why not include those data streams in our Temporal Fusion Transformers that we use for Product Demand Forecasting?

Join us for a thrilling journey through convolutional, graph-based, and transformer-based architectures. Learn about methods to turn images, texts, and geographical information into features for other applications as we did for product demand forecasting. Discover how Picnic Technologies uses state-of-the-art multimodal approaches for demand forecasting to prevent food waste and keep our customers happy!

Bar
11:30
30min
The proof of the pudding is in the (way of) eating: quasi-experimental methods of causal inference and their practical pitfalls
Jakob Willisch

Data scientists and analysts are using quasi-experimental methods to make recommendations based on causality instead of randomized control trials. While these methods are easy to use, their assumptions can be complex to explain. This talk will explain these assumptions for data scientists and analysts without in-depth training of causal inference so they can use and explain these methods more confidently to change people's minds using data.

Foo (main)
12:00
12:00
60min
Lunch Break
Foo (main)
12:00
60min
Lunch Break
Bar
12:00
60min
Lunch Break
Qux
12:30
12:30
30min
Lunch break
Hello, World! (Tutorials)
13:00
13:00
90min
Mastering Knowledge Graph Modeling with Neo4j: A Practical Tutorial
Panos Alexopoulos

This hands-on tutorial introduces participants to knowledge graph modeling using Neo4j, a popular graph database. Suitable for beginners and those seeking to enhance their knowledge, the tutorial will help attendees to learn the fundamentals of knowledge graphs, gain insights into Neo4j's modeling capabilities, and acquire practical skills in designing effective knowledge graph models.

Hello, World! (Tutorials)
13:00
30min
Mastering Recommendation Systems Evaluation: An A/B Testing Approach with Insights from the Industry
ildar safilo

Recommendation systems shape personalized experiences across various sectors, but evaluating their effectiveness remains a significant challenge. Drawing on experiences from industry leaders such as Booking.com, this talk introduces a robust, practical approach to A/B testing for assessing the quality of recommendation systems. The talk is designed for data scientists, statisticians, and business professionals, offering real-world insights and industry tricks on setting up A/B tests, interpreting results, and circumventing common pitfalls. While basic familiarity with recommendation systems and A/B testing is beneficial, it's not a prerequisite.

Qux
13:00
30min
Minimizing the Data Mesh Mess
Cor Zuurmond

This talk delves into the topic of minimizing the Data Mesh mess. We will explore practical strategies and a data platform architecture for effectively governing and managing data within a decentralized data setup. We can balance decentralization and maintaining data quality by imposing a few constraints. The takeaways of this talk are drawn from the data platform at Enza Zaden.

Bar
13:00
30min
Ok, doomer
Laura Summers

AI won't end the world, but it can and is making life miserable for plenty of folks. Instead of engaging with the AI overlords, let's explore a pragmatic set of design choices that all Data Scientists and ML devs can implement right now, to reduce the risks of deploying AI systems in the real world.

Foo (main)
13:40
13:40
30min
Encrypted Computation: What if decryption wasn't needed?
Katharine Jarmul

If you are curious about the field of cryptography and what it has to offer data science and machine learning, this talk is for you! We'll dive into the field of encrypted computation, where decryption isn't needed in order to perform calculations, transformations and operations on the data. You'll learn some of the core mathematical theory behind why and how this works, as well as the differences between approaches like homomorphic encryption and secure multi-party computation. At the end, you'll get some pointers and open-source library hints on where to go next and how to start using encrypted computation for problems you are solving the hard way (or not solving at all).

Bar
13:40
30min
LLM Agents 101: How I Gave ChatGPT Access to My To-Do List
Jordi Smit

ChatGPT is a fantastic assistant, but it cannot do everything yet. For example, it cannot automatically manage my calendar, update my to-do list, or do anything that requires it to perform actions. However, what would it take to make this a reality? I decided to put it to the test by allowing ChatGPT to manage my to-do list for me.

During this presentation, I will tell how I gave ChatGPT access to my to-do list. Along the way, I will introduce you to the concepts behind LLM-based agents and how they work. Of course, I will also give a demo of the final result. After this demo, we will dive into clever engineering solutions and tricks I discovered to solve problems such as handling hallucinations, parsing actions, etc.

This talk is for people who want to learn how to build their first LLM-based agent. Familiarity with Python, PyDantic, and LMMs is nice during this presentation but not essential. As long as you love overengineered solutions to a basic to-do list, you will like this presentation.

Foo (main)
13:40
30min
To One-Hot or Not: A guide to feature encoding and when to use what
Ana Chaloska

Have you ever struggled with a multitude of columns created by One Hot Encoder? Or decided to look beyond it, but found it hard to decide which feature encoder would be a good replacement?

Good news, there are many encoding techniques that have been developed to address different types of categorical data. This talk will provide an overview on various encoding methods available in data science, and a guidance on decision making about which one is appropriate for the data at hand.

Join this talk if you would like to hear about the importance of feature encoding and why it is important to not default to One Hot Encoding in every scenario. It will start with commonly used approaches and will progress into more advanced and powerful techniques which can help extract meaningful information from the data.

For each presented encoder, after this talk you will know:
- When to use it
- When NOT to use it
- Important considerations specific to the encoder
- Python library that offers a built-in method with the encoder, facilitating easy integration into feature engineering pipelines.

Qux
14:20
14:20
30min
Cumulative Index Max in pandas
James Powell

How do we speed up a critical missing operation in pandas, the cumulative index max, and what does this tell us about the compromises and considerations we must bring to optimizing our code?

Foo (main)
14:20
30min
Return to Data's Inferno: are the 7 layers of data testing hell still relevant?
Daniel van der Ende

Back in 2018, a blogpost titled "Data's Inferno: 7 circles of data testing hell with Airflow" presented a layered approach to data quality checks in data applications and pipelines. Now, 5 years later, this talk looks back at Data's Inferno and surveys what has changed but also what hasn't in the space of ensuring high data quality.

Bar
14:20
30min
Survival Analysis: a deep dive
Danial Senejohnny

Survival analysis was initially introduced to handle the data analysis required in use cases revolving death and treatment in health care. Due to its merit, this method has spread to many other domains for analyzing and modeling the data where the outcome is the time until an event of interest occurs. Domains such as finance, economy, sociology and engineering.

This talk aims at unraveling the potential of survival analysis with examples from different domains. A taxonomy of the existing descriptive and predictive analytics algorithms in survival analysis are demonstrated. The concept of some candidate algorithms from each group are explained in detail, along with an example and implementation guideline using the right open source framework.

Qux
14:40
14:40
30min
Unconference: How to Host a DEI Unconference
Arliss Collins

Lorem ipsum dolor

Hello, World! (Tutorials)
14:50
14:50
30min
Snack Break
Foo (main)
14:50
30min
Snack Break
Bar
14:50
30min
Snack Break
Qux
15:20
15:20
30min
Deep look into Deepfakes: Mastering Creation, Impact, and Detection
Maryam Miradi

Deepfakes, a form of synthetic media where a person's image or video is seamlessly replaced using Generative AI like GANs, have recieved significant attention. This talk aims to provide a comprehensive exploration of deepfakes, covering their creation process, positive and negative effects, development pace, and tools for detection. By the end of the presentation, attendees will be equipped with how to create and detect deepfakes, a deep understanding of the technology and its impact.

Qux
15:20
30min
MLOps on the fly: Optimizing a feature store with DuckDB and ArrowFlight
Fabio Buso, Till Döhmen

Feature Stores are a vital part of the MLOps stack for managing machine learning features and ensuring data consistency. This talk introduces Feature Stores and the underlying data management architecture. We’ll then discuss the challenges and learnings of integrating DuckDB and Arrow Flight into the our Feature Store platform, and share benchmarks showing up to 30x speedups compared to Spark/Hive. Discover how DuckDB and ArrowFlight can also speedup your data management and machine learning pipelines.

Foo (main)
15:20
30min
import full-focus as ff – How to reduce stress and pressure as a data specialist.
Maarten Oude Rikmanspoel

Data science, IT and software development become more and more complex and are subject to increasing requirements and fast-paced business demand. Higher complexity, higher pace and higher quality requirements result in more pressure on our fellow data engineers and data scientists.

More pressure, but are we resilient enough to withstand that increasing pressure? You have probably already seen its outcome. Unhappiness, stress or even burn-outs of co-workers, instead of creating cool code, great solutions and building a better world using your skills.

How to change the pressure and stress you perceive as a data scientist, data engineer of ML-engineer? How to ensure that your brain’s frontal lobe returns to a problem solving and decision-making state?

Bar
16:00
16:00
50min
Keynote "Processing billions of tokens for training Large Language Models, tools and knowledge"
Thomas Wolf, Julien Launay, Guilherme Penedo, Alessandro Cappelli

Keynote by Thomas Wolf. He will be accompanied on stage by Alessandro Cappelli, Julien Launay & Guilherme Penedo, all members of the Hugging Face team in Amsterdam working on large model training.

Foo (main)
16:50
16:50
10min
Closing notes day 2
Foo (main)
17:00
17:00
60min
Social event at venue
Foo (main)
08:00
08:00
60min
Breakfast
Foo (main)
08:00
60min
Breakfast
Bar
08:00
60min
Breakfast
Qux
08:00
60min
Breakfast
Hello, World! (Tutorials)
09:00
09:00
50min
Keynote "Natural Intelligence is All You Need [tm]"
Vincent Warmerdam

In this talk I will try to show you what might happen if you allow yourself the creative freedom to rethink and reinvent common practices once in a while. As it turns out, in order to do that, natural intelligence is all you need. And we may start needing a lot of it in the near future

Foo (main)
10:00
10:00
30min
Distillation Unleashed: Domain Knowledge Transfer with Compact Neural Networks
Hadi Abdi Khojasteh

This talk explores distillation learning, a powerful technique for compressing and transferring knowledge from larger neural networks to smaller, more efficient ones. It delves into its core components and various applications such as model compression and transfer learning. The speaker aims to simplify the topic for all audiences and provides implementation, demonstrating how to apply distillation learning in real scenarios. Attendees will gain insights into developing efficient neural networks by reviewing the various examples of the complex model. The material will be accessible online for convenient access and understanding.

Qux
10:00
80min
Generating Data Frames for your test - using Pandas stratgies in Hypothesis
Cheuk Ting Ho

Do you test your data pipeline? Do you use Hypothesis? In this workshop, we will use Hypothesis - a property-based testing framework to generate Pandas DataFrame for your tests, without involving any real data.

Hello, World! (Tutorials)
10:00
30min
Our journey using data and AI to help monitor wildlife in parks in Africa
Maël Deschamps, Simone Gayed Said

Exploration of the intersection between data, AI, and environmental conservation. In this talk, we will share our experiences and practical insights during our journey trying to develop a system using Python, camera traps and data-driven techniques to help detect poachers in Africa.

Bar
10:00
30min
Transfer Learning in Boosting Models
Busra Cikla, Paul Zhutovsky

Did you know that you could do transfer learning on boosted forests too? Even in current days, we face business cases where the modelling sample is very low. This brings an uncertainty to the modelling results and in some cases no ability to model at all. To counter it, we investigated the ability to use transfer learning approaches on boosting models. In this talk, we would like to show the methods used and results from a real case example applied to the credit risk domain.

Foo (main)
10:30
10:30
20min
Coffee Break
Foo (main)
10:30
20min
Coffee Break
Bar
10:30
20min
Coffee Break
Qux
10:50
10:50
30min
Data Contracts in action powered by Python open source ecosystem
Alyona Galyeva

This informative talk aims to close the gap between the theory of data contracts and their real-life implementations. It contains a few Python code snippets and is aimed primarily at data and software engineers. However, it could be food for thought for machine learning engineers, data scientists, and other data consumers.

Qux
10:50
30min
Enhancing Economic Outcomes: Leveraging Business Metrics for Machine Learning Model Optimization
Felipe Moraes

Optimizing machine learning models using regular metrics is a common practice in the industry. However, aligning model optimization with business metrics is closely tied to the objectives of the business and is highly valued by product managers and other stakeholders. This talk delves into the process of training machine learning models based on business metrics in order to enhance economic outcomes. With a primary focus on data scientists and machine learning practitioners, this talk explores techniques, methodologies, and real-world applications that harness the power of business metrics to propel machine learning models and foster business success. We will present a specific case study that demonstrates how we utilized business metrics at Booking.com that brought significant impact on model performance on business outcomes. Specifically, we will discuss our approaches to leveraging business metrics for hyperparameter tuning and reducing model complexity, which instill greater confidence within our team when deploying improved models to production.

Foo (main)
10:50
30min
Standby detection with a human in the loop
Lieke Kools

In the Netherlands a large share of energy is used by industry. By measuring the energy usage of individual machines in real time it is possible to pinpoint when machines are operating inefficiently and help factories take measures to reduce energy waste. It turns out that in most factories, the biggest source of energy waste comes from idling machines. To be able to give valuable insights and provide relevant alerts to our customers, we set up a machine learning system for standby detection with a “human in the loop”. In this talk we will go over the considerations that go into setting up a machine learning system with a human in the loop and showcase our approach to the problem. No background knowledge is required for this talk.

Bar
11:30
11:30
30min
Building true Machine Learning MVPs: Validating the value chain as a product data scientist
Azamat Omuraliev

Some say machine learning projects fail because they live in notebooks.

But I would bet that even more of them fail because their projects solve a problem that doesn’t exist. Or uses an interface that’s not feasible. In other words, they fail because they don’t validate their underlying assumptions.

Product analytics helps build models that solve real problems. In my time at ING, I’ve been dealing with a lot of the latter, and I’ll be sharing my thoughts on how to find problems worth solving with data science.

Bar
11:30
30min
Lets do the time warp again: time series machine learning with distance functions
Tony Bagnall

Many algorithms for machine learning from time series are based on measuring the distance or similarity between series. The most popular distance measure is dynamic time warping, which attempts to optimally realign two series to compensate for offest. There are many others though. We present an overview of the most popular time series specific distance functions and describe their speed optimised implementations in aeon, a scikit-learn compatible time series machine learning toolkit. We demonstrate their application for clustering, classification and regression on a real world case study and highlight some of the latest distance based time series machine learning tools available in aeon.

Qux
11:30
30min
Polars and a peek in the expression engine
Ritchie Vink

This talk we will see why the expression engine in polars is so versatile and fast.
We will look at them in the perspective of the optimizer as well as the physical engine.

Foo (main)
11:30
80min
There are no bad labels, only happy accidents
Ines Montani, Vincent Warmerdam

Are you 100% sure that you can trust your labels?
Imagine spending a company credit card worth of compute on getting the best model statistics ever. Would that be money well spent if your dataset has some labeling issues?
More often than not, "bad labels" are great because they can tell you how to improve the machine learning model before even training it. But it only works only if you actually spend the time being confronted with your own dataset. In this workshop, we'll annotate our own data while we leverage techniques to find happy accidents. To solve specific problems, you don't need loads of data anymore – you just need good data.

Hello, World! (Tutorials)
12:10
12:10
30min
Bayesian ranking for tennis players in PyMC
Francesco Bruzzesi

In this talk, we will explore the Bayesian Bradley Terry model implemented in PyMC. We will focus on its application for ranking tennis players, demonstrating how this probabilistic approach can provide an accurate and robust rankings, arguably better than the ATP ranking itself and the Elo rating system.

By leveraging the power of Bayesian statistics, we can incorporate prior knowledge, handle uncertainty, and make better inferences about player abilities. Join us to learn how to implement the Bayesian Bradley Terry model in PyMC and discover its advantages for ranking tennis players.

Bar
12:10
30min
Using AI to make Amsterdam greener, safer and more accessible
Shayla Jansen, Niek IJzerman

In this talk, we would like to introduce you to the urban challenges that the City of Amsterdam is trying to solve using AI. We will walk you through the technical details behind one of our projects and invite you to join us in the ethical development of cool AI applications for social good.

Foo (main)
12:50
12:50
15min
Closing notes
Foo (main)
14:00
14:00
300min
Open source sprints @ Xebia Data, read more at https://amsterdam.pydata.org/sprints
Foo (main)