10 Jun 2024
4h 29m

ICLR 2024 — Best Papers & Talks (Benchmarks, Reasoning & Agents) — ft. Graham Neubig, Aman Sanger, Moritz Hardt)

Podcast cover

Latent Space: The AI Engineer Podcast

This podcast episode explores various projects and benchmarks aimed at evaluating the performance of language model agents in realistic web-based tasks. It discusses the challenges faced by language models in navigation, filtering, math, and social scenarios, highlighting the gap between language models and humans. The episode also addresses the importance of evaluating language models and understanding their strengths and weaknesses. It introduces several evaluation benchmarks such as WebArena, Sotopia, SWEBench, GAIA, and DynaBench, each focusing on different aspects of language model performance. The discussion also covers topics like code generation, dataset contamination, dataset artifacts, benchmarks in the polymorphic era, and the concept of dynamic benchmarks. The episode concludes by exploring different frameworks and tools like Self-RAG, MetaGPT, and DSPy that aim to improve the performance, reliability, and versatility of language models.

Outlines

Sign in to continue reading, translating and more.

Continue
 
mindmap screenshot
Preview
preview episode cover
How to Get Rich: Every EpisodeNaval