The rise of synthetic data with Florian Hönicke from Jina AI Artwork

Knowledge Distillation with Helen Byrne

Knowledge Distillation is the podcast that brings together a mixture of experts from across the Artificial Intelligence community.
We talk to the world’s leading researchers about their experiences developing cutting-edge models as well as the technologists taking AI tools out of the lab and turning them into commercial products and services.
Knowledge Distillation also takes a critical look at the impact of artificial intelligence on society – opting for expert analysis instead of hysterical headlines.
We are committed to featuring at least 50% female voices on the podcast – elevating the many brilliant women working in AI.

Host Helen Byrne is a VP at the British AI compute systems maker Graphcore where she leads the Solution Architects team, helping innovators build their AI solutions using Graphcore’s technology.

Helen previously led AI Field Engineering and worked in AI Research, tackling problems in distributed machine learning.

Before landing in Artificial Intelligence, Helen worked in FinTech, and as a secondary school teacher. Her background is in mathematics and she has a MSc in Artificial Intelligence.

Knowledge Distillation is produced by Iain Mackenzie.

All Episodes

Knowledge Distillation with Helen Byrne

The rise of synthetic data with Florian Hönicke from Jina AI

January 29, 2024 • Helen Byrne • Season 1 • Episode 4

0:00 | 40:27

Data is the fuel that is powering the AI revolution - but what do we do when there's just not enough data to satisfy the insatiable appetite of new model training?

In this episode, Florian Hönicke, Principal AI Engineer at Jina AI, discusses the use of LLMs to generate synthetic data to help solve the data bottleneck. He also addresses the potential risks associated with an over-reliance on synthetic data.

German startup Jina AI is one of the many exciting companies coming out of Europe, supporting the development and commercialisation of generative AI.

The team at Jina AI gained widespread attention in late 2023 for the release of the first open-source text embedding model with an 8192 token length. Jina-embeddings-v2 achieves state-of-the-art performance on a range of embedding-related tasks and matches the performance of OpenAI's proprietary ada-002 model.

Watch the video of our interview: https://youtu.be/AP80hZajk5w