PyData Amsterdam 2023

To One-Hot or Not: A guide to feature encoding and when to use what
09-15, 13:40–14:10 (Europe/Amsterdam), Qux

Have you ever struggled with a multitude of columns created by One Hot Encoder? Or decided to look beyond it, but found it hard to decide which feature encoder would be a good replacement?

Good news, there are many encoding techniques that have been developed to address different types of categorical data. This talk will provide an overview on various encoding methods available in data science, and a guidance on decision making about which one is appropriate for the data at hand.

Join this talk if you would like to hear about the importance of feature encoding and why it is important to not default to One Hot Encoding in every scenario. It will start with commonly used approaches and will progress into more advanced and powerful techniques which can help extract meaningful information from the data.

For each presented encoder, after this talk you will know:
- When to use it
- When NOT to use it
- Important considerations specific to the encoder
- Python library that offers a built-in method with the encoder, facilitating easy integration into feature engineering pipelines.


I will explore different feature encoding approaches and provide guidance for decision-making. I will cover simpler methods like Label, One Hot, and Frequency encoding, progressing to powerful techniques like Target and Rare Label encoding. Finally, I will explain more complex approaches like Weight of Evidence, Hash and Catboost encoding. I will close the talk with summarizing the key takeaways.

Target Audience:
Data scientists and anyone interested in feature encoding

Previous experience with feature encoders can be useful but is not mandatory to follow the talk.


Prior Knowledge Expected

No previous knowledge expected

Ana is a data scientist experienced in the payments industry with a focus on the risk domain. With background in information and data science, she has contributed to building ML solutions for mitigating customer risk and optimizing customer monitoring processes.