PyData Amsterdam 2023

Don’t judge a book by its cover: Using LLM created datasets to train models that detect literary features
09-14, 13:30–14:00 (Europe/Amsterdam), Qux

Existing book recommendation systems like Goodreads are based on correlating the reading habits of people. But what if you want a humorous book? Or a book that is set in 19th century Paris? Or a thriller, but without violence?
We build book recommendation systems for Dutch libraries based on more than a dozen features from historical setting, to writing style, to main character characteristics. This allows us to tailor each recommendation to individual readers.


The recent developments in LLMs are an interesting area for us to explore to improve our recommendations. However, running LLMs in production is unfortunately not always feasible. The associated costs may be too high, and running code from third parties in your daily pipeline may be undesirable. And then there’s data privacy - or, in our case, intellectual copyright - to be considered as well.

So how can you reap the benefits of an LLM, without exposing yourself or your company to some of these major downsides?

We utilized LLMs to generate custom, tailor-made datasets for our literary feature detection models to train on. This allowed us to benefit from the high performance of large language models, without continued reliance on external parties such as OpenAI or Google.

While you may think LLMs are not as effective for languages other than English, we’ve seen major improvements in several of our models.

In this talk, we’ll highlight:
- A note on recommenders: Why does Goodreads recommender not work for me, while Spotify’s Discover Weekly is so good?
- Different methods of getting data from books
- Iterative process of creating a dataset using an LLM and retraining our models
- Some notes on intellectual property and evaluation of models.


Prior Knowledge Expected

No previous knowledge expected

Typewriter repairman turned Machine Learning Engineer, now working for Bookarang, a Dutch startup working with Dutch libraries to improve the recommendations for its members.
Wrote several picture books, but is not allowed to boost those in the recommendation system.