LLMs are extending from digital text into the physical world and the frontiers of science. In robotics, LLMs serve as high-level planners that translate natural language instructions into sequences of robot actions. In web and OS automation, they operate as agents that navigate interfaces, fill forms, and complete tasks on behalf of users. In scientific discovery, they mine literature, generate hypotheses, prove theorems, and design experiments. These applications represent the cutting edge of what LLMs can do when connected to real-world actuators and scientific knowledge.
1. LLMs as Robot Planners
The key insight behind using LLMs for robotics is that language models possess extensive world knowledge about objects, their properties, and how they relate to each other. A human saying "make me a sandwich" implies a sequence of actions (get bread, get ingredients, assemble, plate) that an LLM can decompose into steps. The challenge is grounding these steps in the robot's actual physical capabilities and environment.
SayCan: Grounding Language in Robot Actions
Google's SayCan combines an LLM's knowledge of what makes sense to do with a robot's learned affordances (what it can physically do). The LLM proposes candidate next actions, and a value function scores each action based on whether the robot can actually execute it in the current state. This product of "what should I do" (LLM) and "what can I do" (affordance model) produces grounded action plans that are both semantically correct and physically feasible.
RT-2: Vision-Language-Action Models
Google's RT-2 (Robotics Transformer 2) takes grounding further by training a single vision-language model that directly outputs robot actions. The model processes camera images and language instructions and outputs discretized action tokens (arm positions, gripper states). By co-training on both internet-scale vision-language data and robot demonstration data, RT-2 acquires emergent reasoning capabilities: it can follow instructions involving concepts never seen during robot training (like "move the object to the picture of a country" by recognizing flags).
# Conceptual: LLM as robot task planner from openai import OpenAI import json client = OpenAI() def plan_robot_actions( instruction: str, available_actions: list, scene_description: str, ) -> list: response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": f"""You are a robot task planner. Available primitive actions: {json.dumps(available_actions)} Current scene: {scene_description} Decompose the instruction into a sequence of available actions. Return a JSON array of action steps with 'action' and 'target'."""}, {"role": "user", "content": instruction}, ], response_format={"type": "json_object"}, ) return json.loads(response.choices[0].message.content) plan = plan_robot_actions( instruction="Put the dirty dishes in the dishwasher", available_actions=["pick", "place", "open", "close", "navigate"], scene_description="Kitchen counter with 3 plates and 2 cups. Dishwasher closed.", ) print(json.dumps(plan, indent=2))
2. Web Automation and Browser Agents
Web automation agents use LLMs to navigate websites, fill forms, click buttons, and complete tasks that normally require human interaction. These agents observe the page (through screenshots, accessibility trees, or DOM parsing), decide what action to take, execute it, and observe the result. This is the same agentic loop from Module 21, applied to browser environments.
# Conceptual: web automation agent using browser tools def web_agent_step(task: str, page_state: dict) -> dict: response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": """You are a web automation agent. Given the current page state, decide the next action to complete the task. Actions: click(selector), type(selector, text), navigate(url), scroll(direction), wait(), done(result). Return JSON with 'thought', 'action', and 'params'."""}, {"role": "user", "content": f"""Task: {task} Page title: {page_state['title']} Interactive elements: {json.dumps(page_state['elements'])}"""}, ], response_format={"type": "json_object"}, ) return json.loads(response.choices[0].message.content)
OS-Level Agents and Computer Use
OS-level agents extend web automation to the entire desktop. Anthropic's Computer Use API, for example, lets Claude interact with a computer through screenshots and mouse/keyboard actions. The agent observes the screen, reasons about what it sees, and executes actions like clicking buttons, typing text, or switching between applications. This capability enables automation of tasks that span multiple applications (like copying data from a spreadsheet to an email) without requiring application-specific APIs.
| Agent Type | Environment | Observation | Actions | Example |
|---|---|---|---|---|
| Robot planner | Physical world | Camera, sensors | Pick, place, navigate | SayCan, RT-2 |
| Web agent | Browser | DOM, screenshots | Click, type, navigate | WebArena, BrowserGym |
| OS agent | Desktop | Screenshots | Mouse, keyboard | Computer Use, OSWorld |
| Code agent | IDE / terminal | Files, outputs | Read, write, execute | Claude Code, Devin |
Evaluating agents that interact with real environments requires specialized benchmarks. WebArena tests web agents on realistic tasks (managing e-commerce sites, forums). OSWorld benchmarks OS-level agents on desktop tasks across operating systems. SQA (Situated Question Answering) tests robot understanding of physical environments. These benchmarks reveal that current agents succeed at simple, well-defined tasks but struggle with multi-step sequences, error recovery, and tasks requiring spatial reasoning or common sense about the physical world.
3. AI for Mathematics and Theorem Proving
LLMs are making significant inroads in mathematical reasoning and formal theorem proving. Google's AlphaProof and AlphaGeometry demonstrated superhuman performance on International Mathematical Olympiad problems, using a combination of LLMs for informal reasoning and formal verification systems for proof checking. These systems represent a new paradigm where AI augments mathematical discovery rather than just calculation.
# Using an LLM for mathematical reasoning response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": """You are a mathematical reasoning assistant. Work through problems step by step. Show all reasoning. When uncertain, explore multiple approaches. Verify your answer by checking boundary cases and special values."""}, {"role": "user", "content": """Prove that for any positive integer n, the sum 1 + 2 + ... + n = n(n+1)/2. Use mathematical induction."""}, ], ) print(response.choices[0].message.content)
4. Scientific Literature Mining and Hypothesis Generation
The scientific literature grows by millions of papers per year, making it impossible for any researcher to stay current even in a narrow field. LLMs can mine this literature to identify connections between findings, generate novel hypotheses, and suggest experimental designs. Systems like Semantic Scholar's AI-powered features and specialized scientific LLMs (Galactica, SciBERT) demonstrate how language models can accelerate the scientific discovery process.
The common thread across robotics, web automation, and scientific discovery is the LLM's role as a reasoning and planning layer that sits above domain-specific execution systems. In robotics, the LLM plans while robot controllers execute. In web automation, the LLM decides while browser APIs act. In science, the LLM hypothesizes while experiments validate. This separation of reasoning (LLM strength) from execution (domain-specific strength) is a powerful architectural pattern that enables LLMs to extend into virtually any domain with appropriate grounding.
When LLMs control physical systems (robots, industrial equipment) or have broad computer access (OS agents), the consequences of errors become physical and potentially dangerous. A misplanned robot action can break objects or injure people. An OS agent with unrestricted access can delete files or send unauthorized messages. Robust safety measures are essential: action validation before execution, restricted action spaces, human approval for irreversible actions, sandbox environments for testing, and continuous monitoring of agent behavior.
Knowledge Check
Show Answer
Show Answer
Show Answer
Show Answer
Show Answer
Key Takeaways
- SayCan grounds LLM plans in robot capabilities by combining semantic understanding with physical affordance scoring.
- RT-2 achieves end-to-end vision-language-action reasoning, generalizing to concepts not seen during robot training.
- Web automation agents use observe-think-act loops with DOM and screenshot observations to navigate and interact with websites.
- OS-level agents (Computer Use) extend automation beyond browsers to the full desktop, enabling cross-application workflows.
- AI for mathematics combines LLM informal reasoning with formal verification, achieving breakthrough results on competition-level problems.
- Scientific discovery benefits from LLM literature mining, hypothesis generation, and experiment design, with human researchers validating and testing AI-generated ideas.
- Safety is paramount when LLMs control physical systems or have broad computer access; action validation, sandboxing, and human oversight are essential.