09-16, 11:30–12:50 (Europe/Amsterdam), Hello, World! (Tutorials)
Are you 100% sure that you can trust your labels?
Imagine spending a company credit card worth of compute on getting the best model statistics ever. Would that be money well spent if your dataset has some labeling issues?
More often than not, "bad labels" are great because they can tell you how to improve the machine learning model before even training it. But it only works only if you actually spend the time being confronted with your own dataset. In this workshop, we'll annotate our own data while we leverage techniques to find happy accidents. To solve specific problems, you don't need loads of data anymore – you just need good data.
This workshop is split up into two segments.
First, we will dive into some data quality issues of some well-known datasets. We will have prepared some datasets along with code, tricks, and Python tools to help you understand the data quality (or the lack thereof). While doing this, we will also discuss some theories behind annotator agreement metrics and show how they might help you make decisions.
Once we have some experience with these techniques, we will annotate some data as a group, after which we will share the dataset so that the group may analyze and be confronted with their own annotations.
The goal of the workshop is to give people the ability to detect a "data smell". Through hands-on experimentation with data quality challenges, you'll learn how to better reason about and iterate on your own data. Data quality really matters, and confronting the labeling process yourself can help you better your machine learning pipeline and evaluation process.
No previous knowledge expected
Ines Montani is a developer specializing in tools for AI and NLP technology. She’s the co-founder and CEO of Explosion and a core developer of spaCy, a popular open-source library for Natural Language Processing in Python, and Prodigy, a modern annotation tool for creating training data for machine learning models.
Vincent D. Warmerdam is a software developer and senior data person. He’s currently works over at Explosion to work on data quality tools for developers. He’s also known for creating calmcode.io as well as a bunch of open source projects. You can check out his blog over at koaning.io to learn more about those.