Module 24 · Section 24.7

Robotics, Embodied AI & Scientific Discovery

LLMs as robot planners, web automation agents, OS-level agents, AI for mathematics, and scientific literature mining
★ Big Picture

LLMs are extending from digital text into the physical world and the frontiers of science. In robotics, LLMs serve as high-level planners that translate natural language instructions into sequences of robot actions. In web and OS automation, they operate as agents that navigate interfaces, fill forms, and complete tasks on behalf of users. In scientific discovery, they mine literature, generate hypotheses, prove theorems, and design experiments. These applications represent the cutting edge of what LLMs can do when connected to real-world actuators and scientific knowledge.

1. LLMs as Robot Planners

The key insight behind using LLMs for robotics is that language models possess extensive world knowledge about objects, their properties, and how they relate to each other. A human saying "make me a sandwich" implies a sequence of actions (get bread, get ingredients, assemble, plate) that an LLM can decompose into steps. The challenge is grounding these steps in the robot's actual physical capabilities and environment.

SayCan: Grounding Language in Robot Actions

Google's SayCan combines an LLM's knowledge of what makes sense to do with a robot's learned affordances (what it can physically do). The LLM proposes candidate next actions, and a value function scores each action based on whether the robot can actually execute it in the current state. This product of "what should I do" (LLM) and "what can I do" (affordance model) produces grounded action plans that are both semantically correct and physically feasible.

Human "clean the table" LLM Planner "what should I do?" Affordance Model "what can I do?" Score P(useful) x P(can) Robot pick, place, wipe execution feedback
Figure 24.10: SayCan architecture. The LLM proposes actions scored by affordance models, and the robot executes the highest-scoring feasible action.

RT-2: Vision-Language-Action Models

Google's RT-2 (Robotics Transformer 2) takes grounding further by training a single vision-language model that directly outputs robot actions. The model processes camera images and language instructions and outputs discretized action tokens (arm positions, gripper states). By co-training on both internet-scale vision-language data and robot demonstration data, RT-2 acquires emergent reasoning capabilities: it can follow instructions involving concepts never seen during robot training (like "move the object to the picture of a country" by recognizing flags).

# Conceptual: LLM as robot task planner
from openai import OpenAI
import json

client = OpenAI()

def plan_robot_actions(
    instruction: str,
    available_actions: list,
    scene_description: str,
) -> list:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"""You are a robot task planner.
Available primitive actions: {json.dumps(available_actions)}
Current scene: {scene_description}
Decompose the instruction into a sequence of available actions.
Return a JSON array of action steps with 'action' and 'target'."""},
            {"role": "user", "content": instruction},
        ],
        response_format={"type": "json_object"},
    )
    return json.loads(response.choices[0].message.content)

plan = plan_robot_actions(
    instruction="Put the dirty dishes in the dishwasher",
    available_actions=["pick", "place", "open", "close", "navigate"],
    scene_description="Kitchen counter with 3 plates and 2 cups. Dishwasher closed.",
)
print(json.dumps(plan, indent=2))

2. Web Automation and Browser Agents

Web automation agents use LLMs to navigate websites, fill forms, click buttons, and complete tasks that normally require human interaction. These agents observe the page (through screenshots, accessibility trees, or DOM parsing), decide what action to take, execute it, and observe the result. This is the same agentic loop from Module 21, applied to browser environments.

# Conceptual: web automation agent using browser tools
def web_agent_step(task: str, page_state: dict) -> dict:
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": """You are a web automation agent.
Given the current page state, decide the next action to complete
the task. Actions: click(selector), type(selector, text),
navigate(url), scroll(direction), wait(), done(result).
Return JSON with 'thought', 'action', and 'params'."""},
            {"role": "user", "content": f"""Task: {task}
Page title: {page_state['title']}
Interactive elements: {json.dumps(page_state['elements'])}"""},
        ],
        response_format={"type": "json_object"},
    )
    return json.loads(response.choices[0].message.content)

OS-Level Agents and Computer Use

OS-level agents extend web automation to the entire desktop. Anthropic's Computer Use API, for example, lets Claude interact with a computer through screenshots and mouse/keyboard actions. The agent observes the screen, reasons about what it sees, and executes actions like clicking buttons, typing text, or switching between applications. This capability enables automation of tasks that span multiple applications (like copying data from a spreadsheet to an email) without requiring application-specific APIs.

Agent Type Environment Observation Actions Example
Robot planner Physical world Camera, sensors Pick, place, navigate SayCan, RT-2
Web agent Browser DOM, screenshots Click, type, navigate WebArena, BrowserGym
OS agent Desktop Screenshots Mouse, keyboard Computer Use, OSWorld
Code agent IDE / terminal Files, outputs Read, write, execute Claude Code, Devin
📘 Benchmarking Embodied Agents

Evaluating agents that interact with real environments requires specialized benchmarks. WebArena tests web agents on realistic tasks (managing e-commerce sites, forums). OSWorld benchmarks OS-level agents on desktop tasks across operating systems. SQA (Situated Question Answering) tests robot understanding of physical environments. These benchmarks reveal that current agents succeed at simple, well-defined tasks but struggle with multi-step sequences, error recovery, and tasks requiring spatial reasoning or common sense about the physical world.

3. AI for Mathematics and Theorem Proving

LLMs are making significant inroads in mathematical reasoning and formal theorem proving. Google's AlphaProof and AlphaGeometry demonstrated superhuman performance on International Mathematical Olympiad problems, using a combination of LLMs for informal reasoning and formal verification systems for proof checking. These systems represent a new paradigm where AI augments mathematical discovery rather than just calculation.

# Using an LLM for mathematical reasoning
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": """You are a mathematical reasoning assistant.
Work through problems step by step. Show all reasoning.
When uncertain, explore multiple approaches. Verify your
answer by checking boundary cases and special values."""},
        {"role": "user", "content": """Prove that for any positive integer n,
the sum 1 + 2 + ... + n = n(n+1)/2.
Use mathematical induction."""},
    ],
)
print(response.choices[0].message.content)

4. Scientific Literature Mining and Hypothesis Generation

The scientific literature grows by millions of papers per year, making it impossible for any researcher to stay current even in a narrow field. LLMs can mine this literature to identify connections between findings, generate novel hypotheses, and suggest experimental designs. Systems like Semantic Scholar's AI-powered features and specialized scientific LLMs (Galactica, SciBERT) demonstrate how language models can accelerate the scientific discovery process.

Literature papers, preprints patents, data Scientific LLM extract, synthesize connect findings Hypotheses novel connections Experiment Design protocols, controls Researcher evaluate, test validate
Figure 24.11: AI-assisted scientific discovery. LLMs mine literature, generate hypotheses, and suggest experiments for researcher evaluation.
🔍 Key Insight

The common thread across robotics, web automation, and scientific discovery is the LLM's role as a reasoning and planning layer that sits above domain-specific execution systems. In robotics, the LLM plans while robot controllers execute. In web automation, the LLM decides while browser APIs act. In science, the LLM hypothesizes while experiments validate. This separation of reasoning (LLM strength) from execution (domain-specific strength) is a powerful architectural pattern that enables LLMs to extend into virtually any domain with appropriate grounding.

⚠ Safety in Embodied AI

When LLMs control physical systems (robots, industrial equipment) or have broad computer access (OS agents), the consequences of errors become physical and potentially dangerous. A misplanned robot action can break objects or injure people. An OS agent with unrestricted access can delete files or send unauthorized messages. Robust safety measures are essential: action validation before execution, restricted action spaces, human approval for irreversible actions, sandbox environments for testing, and continuous monitoring of agent behavior.

Knowledge Check

1. How does SayCan combine LLM knowledge with robot capabilities?
Show Answer
SayCan multiplies two scores for each candidate action: the LLM's assessment of how useful the action is for the current task (semantic grounding) and the affordance model's assessment of whether the robot can physically execute the action in the current state (physical grounding). The action with the highest combined score is selected, ensuring plans are both semantically sensible and physically feasible.
2. What makes RT-2 different from SayCan in its approach to robot planning?
Show Answer
SayCan uses separate models: an LLM for planning and affordance models for grounding, with actions selected from a predefined set. RT-2 is an end-to-end vision-language-action model that directly outputs robot action tokens from camera images and language instructions. By co-training on internet data and robot data, RT-2 can generalize to novel concepts without needing them to be in the predefined action set.
3. How do web automation agents observe and interact with web pages?
Show Answer
Web agents observe pages through multiple modalities: screenshots (visual understanding of layout), DOM/accessibility trees (structured representation of page elements), and HTML parsing (detailed element properties). They interact through browser actions: clicking elements, typing text, navigating URLs, scrolling, and waiting for page loads. The agent follows an observe-think-act loop, using the LLM to decide the next action based on the current page state and task progress.
4. How did AlphaProof and AlphaGeometry achieve mathematical reasoning?
Show Answer
AlphaProof uses an LLM for informal mathematical reasoning (generating proof ideas in natural language) combined with a formal verification system (Lean 4) that checks proof correctness rigorously. AlphaGeometry combines a neural language model for generating geometric constructions with a symbolic deduction engine that verifies each step. Both systems demonstrate that combining LLM creativity with formal verification produces stronger mathematical reasoning than either alone.
5. Why is the separation of reasoning and execution a powerful pattern for LLM applications?
Show Answer
Separating reasoning (what should be done) from execution (how to do it) lets LLMs leverage their strength in understanding, planning, and natural language while delegating precise actions to domain-specific systems optimized for execution. This pattern enables LLMs to extend into robotics, web automation, science, and other domains without needing to solve the full execution problem. It also improves safety, because execution systems can validate and constrain LLM-generated plans before acting.

Key Takeaways