Section 24.3: Healthcare & Biomedical AI

★ Big Picture

Healthcare represents both the highest-stakes and highest-potential domain for LLM applications. Medical LLMs can assist with clinical documentation, diagnostic reasoning, patient communication, drug discovery, and literature synthesis. However, the consequences of errors are severe: incorrect medical information can directly harm patients. This creates a unique tension between the transformative potential of AI in healthcare and the stringent safety, privacy, and regulatory requirements that govern medical practice.

1. Medical LLMs

General-purpose LLMs perform surprisingly well on medical benchmarks. GPT-4 passed the United States Medical Licensing Examination (USMLE) with a score above 90%. However, medical LLMs fine-tuned on clinical data offer advantages in understanding medical terminology, following clinical reasoning patterns, and generating responses appropriate for healthcare contexts.

Model	Base	Training Focus	Notable Result
Med-PaLM 2	PaLM 2	Medical QA, clinical reasoning	86.5% on MedQA (expert level)
PMC-LLaMA	LLaMA	PubMed Central papers	Open-source biomedical LLM
BioMistral	Mistral	Biomedical literature	Strong on clinical NLP tasks
Meditron	LLaMA 2	Medical guidelines, PubMed	Clinical guideline adherence

2. Clinical NLP Applications

Clinical NLP processes the vast amount of unstructured text in electronic health records (EHRs). Progress notes, discharge summaries, radiology reports, and pathology findings contain critical clinical information that is difficult to query or analyze in text form. LLMs can extract structured data from these notes, identify patients matching clinical trial criteria, detect adverse drug events, and summarize patient histories.

from transformers import pipeline

# Clinical NER using a biomedical model
clinical_ner = pipeline(
    "token-classification",
    model="d4data/biomedical-ner-all",
    aggregation_strategy="simple",
)

clinical_note = """Patient presents with persistent cough and shortness of breath
for 2 weeks. History of Type 2 diabetes managed with metformin 500mg.
Chest X-ray shows bilateral infiltrates. Started on azithromycin
and referred for pulmonary function testing."""

entities = clinical_ner(clinical_note)
for ent in entities:
    print(f"  {ent['entity_group']:>20}: {ent['word']} ({ent['score']:.3f})")

Figure 24.5: Clinical NLP pipeline. EHR text is processed by medical LLMs for structured extraction, clinical trial matching, and adverse drug event detection.

3. Medical Question Answering

from openai import OpenAI

client = OpenAI()

# Medical QA with safety guardrails
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": """You are a medical information assistant for clinicians.
Provide evidence-based answers citing relevant guidelines and studies.
Always note the level of evidence. Flag when a question requires
specialist consultation. Never provide direct patient treatment
recommendations without specifying they need clinical validation."""},
        {"role": "user", "content": """What are the current first-line treatments for
newly diagnosed Type 2 diabetes in adults with HbA1c between 7-8%?"""},
    ],
)

print(response.choices[0].message.content)

4. Drug Discovery and Molecular Generation

LLMs trained on chemical and molecular data can generate novel drug candidates, predict molecular properties, and optimize lead compounds. These models treat molecules as sequences (SMILES notation) and apply the same autoregressive generation techniques used for text. More specialized approaches use graph neural networks or 3D molecular representations, but LLM-based methods benefit from the ability to incorporate textual descriptions of desired properties alongside molecular structures.

# Molecular property prediction with a chemistry LLM
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

model_id = "seyonec/ChemBERTa-zinc-base-v1"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)

# SMILES representation of aspirin
smiles = "CC(=O)Oc1ccccc1C(=O)O"

inputs = tokenizer(smiles, return_tensors="pt")
with torch.no_grad():
    outputs = model(**inputs)
    prediction = outputs.logits.softmax(dim=-1)
print(f"Molecular properties prediction: {prediction}")

5. Protein Structure and Genomics

Protein language models like ESM-2 (Evolutionary Scale Modeling) treat amino acid sequences as "text" and learn representations that capture protein structure and function. AlphaFold 3 uses a diffusion-based architecture to predict 3D structures of proteins, nucleic acids, and their complexes. These tools are transforming structural biology by enabling rapid structure prediction that previously required months of experimental work.

⚠ HIPAA and FDA Compliance

Healthcare LLM applications must comply with HIPAA (Health Insurance Portability and Accountability Act), which governs the use of Protected Health Information (PHI). This means: no PHI in prompts sent to cloud APIs without a Business Associate Agreement (BAA), data must be encrypted in transit and at rest, access must be logged and auditable, and minimum necessary data should be used. For clinical decision support, FDA clearance may be required depending on the intended use. Software that provides diagnostic recommendations is regulated as a medical device under FDA 21 CFR Part 820.

🔍 Key Insight

The regulatory pathway for medical AI is becoming clearer but remains complex. The FDA's "predetermined change control plan" allows AI systems to be updated after approval if the update process was pre-specified. This is critical for LLM-based systems that benefit from continuous improvement. The key distinction is between "AI as tool" (clinician uses AI output as one input to their decision) and "AI as autonomous decision-maker" (AI directly determines treatment). Current regulations strongly favor the former, where the human clinician retains decision authority.

Knowledge Check

1. Why do medical LLMs need different safety considerations than general-purpose LLMs?

Show Answer

Medical LLMs can directly impact patient health if they provide incorrect information. A hallucinated drug interaction, incorrect dosage, or missed contraindication could lead to patient harm. Medical LLMs need: stronger factual grounding (citations to medical literature), explicit uncertainty communication, clear disclaimers about clinical validation, and guardrails that prevent direct treatment recommendations without appropriate caveats.

2. How do LLMs assist with clinical trial matching?

Show Answer

LLMs process unstructured EHR text to extract patient characteristics (diagnoses, lab values, medications, demographics) and match them against clinical trial eligibility criteria. Traditional approaches require manual chart review or rigid rule-based systems. LLMs can understand nuanced inclusion/exclusion criteria expressed in natural language and identify eligible patients at scale, accelerating trial enrollment.

3. What HIPAA requirements apply to using cloud LLM APIs for clinical data?

Show Answer

HIPAA requires: a signed Business Associate Agreement (BAA) with the cloud provider before sending any PHI, encryption of PHI in transit and at rest, logging all access to PHI for audit purposes, using the minimum necessary PHI for the task, and ensuring the provider's data handling practices meet HIPAA security standards. Many cloud providers (OpenAI, Google, Azure) offer HIPAA-eligible configurations with BAAs.

4. How do protein language models like ESM-2 represent proteins?

Show Answer

Protein language models treat amino acid sequences as text, with each of the 20 standard amino acids as a token. They are trained on millions of protein sequences using masked language modeling (similar to BERT), learning to predict masked amino acids from context. The resulting embeddings capture evolutionary relationships, structural properties, and functional information, enabling zero-shot prediction of protein properties from sequence alone.

5. What is the FDA's distinction between "AI as tool" and "AI as autonomous decision-maker"?

Show Answer

"AI as tool" means the clinician uses AI output as one input to their own clinical decision, retaining final authority. This faces lighter regulatory scrutiny. "AI as autonomous decision-maker" means the AI directly determines treatment or diagnosis without human review, facing stringent FDA medical device regulations. Current regulations strongly favor the tool paradigm, requiring human clinicians to remain in the decision loop for patient care.

Key Takeaways

Medical LLMs (Med-PaLM 2, BioMistral, Meditron) achieve expert-level performance on medical QA benchmarks but require careful deployment with safety guardrails.
Clinical NLP extracts structured data from EHR text, enabling clinical trial matching, adverse event detection, and patient history summarization.
Drug discovery LLMs treat molecules as sequences (SMILES), enabling generation of novel candidates and property prediction.
Protein language models (ESM-2) learn structural and functional properties from amino acid sequences, transforming structural biology.
HIPAA compliance requires BAAs for cloud APIs, PHI encryption, access logging, and minimum necessary data principles.
FDA regulation distinguishes between AI as a clinical tool (lighter oversight) and AI as an autonomous decision-maker (medical device regulation).