Module 07: Modern LLM Landscape & Model Internals

Chapter Overview

The large language model ecosystem has grown at a breathtaking pace. Closed-source frontier models from OpenAI, Anthropic, and Google push the boundaries of capability, while open-weight releases from Meta, DeepSeek, Mistral, Alibaba, and Microsoft have democratized access to powerful models that anyone can download, fine-tune, and deploy. Meanwhile, a new class of reasoning models has emerged, shifting compute from training time to inference time through extended chains of thought, process reward models, and tree search over candidate solutions.

This chapter surveys the current landscape across four complementary perspectives. We begin with the closed-source frontier (Section 7.1), examining the capabilities, pricing, and architectural hints available for GPT-4o, Claude, Gemini, and their competitors. Section 7.2 dives deep into open-source and open-weight models, with particular attention to architectural innovations like DeepSeek V3's Multi-head Latent Attention, FP8 training, and auxiliary-loss-free Mixture of Experts. Section 7.3 explores the paradigm shift toward reasoning models and test-time compute scaling. Finally, Section 7.4 addresses the multilingual and cross-cultural dimensions that determine whether these models serve a global audience or remain English-centric tools.

Learning Objectives

Compare frontier closed-source models on capability dimensions including reasoning, multimodality, context length, and pricing
Explain the architectural innovations in DeepSeek V3 (MLA, FP8, auxiliary-loss-free MoE) and their impact on efficiency
Articulate the difference between train-time and test-time compute scaling, and identify when each is preferable
Implement best-of-N sampling with a reward model and explain process vs. outcome reward models
Evaluate multilingual LLM capabilities and understand the challenges of cross-lingual transfer
Navigate the Hugging Face ecosystem to discover, download, and run open-weight models locally
Describe Monte Carlo Tree Search applied to language generation and the AlphaProof approach

Prerequisites

Module 04: Transformer architecture (attention mechanism, multi-head attention, feed-forward layers)
Module 05: Decoding strategies (greedy, beam search, sampling methods)
Module 06: Pre-training, scaling laws, and data curation fundamentals
Basic familiarity with Python and the Hugging Face Transformers library

Chapter Overview

Learning Objectives

Prerequisites

Sections