12 Jul 2024
58m

Benchmarks 201: Why Leaderboards > Arenas >> LLM-as-Judge

Podcast cover

Latent Space: The AI Engineer Podcast

This podcast episode explores the journey of HuggingFace's evaluation practice, the significance of leaderboards in the AI community, and the limitations and biases in current evaluation methods. It emphasizes the need for reproducibility, unbiased evaluation metrics, and continuous benchmarking in the field of AI. The conversation also discusses the challenges of human evaluations, the role of prompts in model evaluations, and the limitations of compute resources. The episode concludes with a discussion on future improvements for the leaderboard and the excitement for upcoming evaluations.

Outlines

Sign in to continue reading, translating and more.

Continue
 
mindmap screenshot
Preview
preview episode cover
How to Get Rich: Every EpisodeNaval