09-14, 15:10–15:40 (Europe/Amsterdam), Bar
With Apache Iceberg, you store your big data in the cloud as files (e.g., Parquet), but then query it as if it’s a plain SQL table. You enjoy the endless scalability of the cloud, without having to worry about how to store, partition, or query your data efficiently. PyIceberg is the Python implementation of Apache Iceberg that loads your Iceberg tables into PyArrow (pandas), DuckDB, or any of your preferred engines for doing data science. This means that with PyIceberg, you can tap into big data easily by only using Python. It’s time to say goodbye to the ancient Hadoop-based frameworks of the past! In this talk, you'll learn why you need Iceberg, how to use it, and why it is so fast.
Description: Working with high volumes of data has always been complex and challenging. Querying data with Spark requires you to know how the data is partitioned, otherwise, your query performance suffers tremendously. The Apache Iceberg open table format fixes this by fixing the underlying storage, instead of by educating the end users. Iceberg originated at Netflix and provides a cloud-native layer on top of your data files. It solves traditional issues regarding correctness by supporting concurrent reading and writing to the table. Iceberg improves performance dramatically by collecting metrics on the data, having the ability to easily repartition your data, and being able to compact the underlying data. Finally, it supports time travel, so the model that you're training doesn't change because new data has been added. After this talk, you'll be comfortable using Apache Iceberg.
Minutes 0-5: History and why we need a table format
Minutes 5-15: Overview of Iceberg, and how it works under the hood
Minutes 15-30: Introduction to PyIceberg with code and real examples (notebook!!)
No previous knowledge expected
Open Source enthousiast. Committer on Avro, Parquet, Druid, Airflow and Iceberg. Apache Software Foundation members.