YouTube11 Feb 2025
18m

System Design for Next-Gen Frontier Models — Dylan Patel, SemiAnalysis

Podcast cover

AI Engineer

This podcast focuses on the challenges of large language model (LLM) inference, particularly for models with trillions of parameters. The speaker discusses the computational demands of pre-fill and decode processes, highlighting the need for techniques like continuous batching and disaggregated pre-fill to improve efficiency and cost-effectiveness. He also explores the limitations of current open-source libraries and the need for advancements in areas such as context caching to reduce costs associated with processing large input prompts. For example, the speaker notes that a 2,000-token prompt requires a petaflop of compute, and that current pricing models show a 3-4x difference in cost between input and output token processing. The discussion concludes with a look at the massive scale of next-generation LLM training clusters and the associated hardware and reliability challenges.

Outlines

Part 1: LLM Landscape

Part 2: Inference Optimization

Part 3: Training Clusters and Scaling Challenges

Part 4: Future Outlook

Sign in to continue reading, translating and more.

Continue
 
mindmap screenshot
Preview
preview episode cover
How to Get Rich: Every EpisodeNaval