Module 26 · Section 26.10

LLM Licensing, IP & Privacy

Model license taxonomy, commercial use, IP ownership, training data copyright, anonymization, and differential privacy
★ Big Picture

The legal landscape for LLMs is complex and unsettled. Model licenses range from fully open (Apache 2.0) to restrictive "open weight" releases with acceptable use policies. Who owns the intellectual property in LLM outputs remains legally uncertain. Training data copyright is actively litigated. And privacy requirements demand technical solutions like anonymization and differential privacy. Engineers must understand these issues to make defensible deployment decisions.

1. Model License Taxonomy

License TypeCommercial UseModificationExamples
Apache 2.0Yes, unrestrictedYesMistral 7B, Phi-3
MITYes, unrestrictedYesSome small models
Llama CommunityYes (under 700M MAU)YesLlama 3, Llama 3.1
Gemma TermsYes (with restrictions)YesGemma, Gemma 2
CC-BY-NCNoYes (non-commercial)Some research models
Proprietary APIPer ToSNo access to weightsGPT-4o, Claude, Gemini
Model License Openness Spectrum Apache 2.0 Most open Llama License Open weights + AUP CC-BY-NC Non-commercial only Proprietary API No weights access
Figure 26.10.1: Model licenses range from fully open (Apache 2.0) to proprietary API-only access with no weight availability.
def check_license_compatibility(model_license: str, use_case: dict) -> dict:
    """Check if a model license permits the intended use case."""
    rules = {
        "apache-2.0": {"commercial": True, "modification": True, "distribution": True},
        "llama-community": {"commercial": True, "modification": True,
                            "distribution": True, "mau_limit": 700_000_000},
        "cc-by-nc-4.0": {"commercial": False, "modification": True, "distribution": True},
    }

    license_rules = rules.get(model_license, {})
    issues = []

    if use_case.get("commercial") and not license_rules.get("commercial"):
        issues.append("Commercial use not permitted")
    if use_case.get("mau", 0) > license_rules.get("mau_limit", float("inf")):
        issues.append(f"MAU exceeds license limit of {license_rules['mau_limit']:,}")

    return {"compatible": len(issues) == 0, "issues": issues}

result = check_license_compatibility(
    "llama-community",
    {"commercial": True, "mau": 500_000}
)
print(result)
{'compatible': True, 'issues': []}

2. Differential Privacy for LLM Training

import torch
import numpy as np

def dp_sgd_step(gradients: list, clip_norm: float, noise_scale: float, lr: float):
    """Simulate a DP-SGD gradient step with clipping and noise."""
    clipped = []
    for grad in gradients:
        grad_norm = np.linalg.norm(grad)
        clip_factor = min(1.0, clip_norm / (grad_norm + 1e-8))
        clipped.append(grad * clip_factor)

    # Average clipped gradients
    avg_grad = np.mean(clipped, axis=0)

    # Add calibrated Gaussian noise
    noise = np.random.normal(0, noise_scale * clip_norm / len(gradients), avg_grad.shape)
    noisy_grad = avg_grad + noise

    return {
        "update": -lr * noisy_grad,
        "avg_clip_factor": np.mean([min(1, clip_norm / (np.linalg.norm(g) + 1e-8)) for g in gradients]),
        "noise_magnitude": np.linalg.norm(noise),
    }

# Simulate with 4 per-sample gradients
grads = [np.random.randn(10) * s for s in [0.5, 2.0, 0.3, 1.5]]
result = dp_sgd_step(grads, clip_norm=1.0, noise_scale=0.5, lr=0.01)
print(f"Avg clip factor: {result['avg_clip_factor']:.3f}")
print(f"Noise magnitude: {result['noise_magnitude']:.4f}")

3. IP Ownership of LLM Outputs

LLM IP Ownership: Open Questions Training Data Fair use for training? Opt-out mechanisms? Compensation models? Active litigation Model Outputs Who owns AI text? Copyrightable? User vs. provider? Jurisdiction-dependent Fine-Tuned Models Derivative work? License inheritance? Trade secret status? License-dependent
Figure 26.10.2: IP ownership questions span training data rights, output copyrightability, and fine-tuned model status.
def anonymize_text(text: str, entities: dict) -> str:
    """Replace identified entities with consistent pseudonyms."""
    pseudonym_map = {}
    counter = {}

    for entity_type, values in entities.items():
        counter[entity_type] = 0
        for value in values:
            if value not in pseudonym_map:
                counter[entity_type] += 1
                pseudonym_map[value] = f"[{entity_type}_{counter[entity_type]}]"

    result = text
    for original, pseudonym in pseudonym_map.items():
        result = result.replace(original, pseudonym)

    return result

text = "John Smith from Acme Corp called about order 12345."
anon = anonymize_text(text, {
    "PERSON": ["John Smith"],
    "ORG": ["Acme Corp"],
    "ID": ["12345"],
})
print(anon)
[PERSON_1] from [ORG_1] called about order [ID_1].
⚠ Warning

The Llama Community License requires that applications with more than 700 million monthly active users must request a separate license from Meta. If your product approaches this scale, you need a commercial agreement. Always read the full license terms, not just the summary, before deploying any model commercially.

📝 Note

Differential privacy provides a mathematical guarantee that any individual training example has limited influence on the trained model. The privacy budget (epsilon) controls the tradeoff: lower epsilon means stronger privacy but noisier gradients, typically reducing model quality. Current DP-SGD for LLMs remains an active research area, as the privacy-utility tradeoff is still steep for large models.

★ Key Insight

In most jurisdictions, AI-generated content cannot be copyrighted because copyright requires human authorship. However, if a human provides substantial creative direction in the prompt and edits the output, the resulting work may qualify for copyright protection. The boundary is unclear and varies by jurisdiction.

Knowledge Check

1. What is the key restriction in the Llama Community License compared to Apache 2.0?

Show Answer
The Llama Community License includes a monthly active user (MAU) threshold of 700 million, above which a separate commercial license from Meta is required. It also includes an acceptable use policy that prohibits certain harmful applications. Apache 2.0 has no such usage restrictions.

2. How does differential privacy protect individual training examples?

Show Answer
DP-SGD clips per-sample gradients to a maximum norm and adds calibrated Gaussian noise to the averaged gradient. This ensures that any single training example has limited influence on the final model parameters (bounded by the privacy budget epsilon). An attacker cannot determine with confidence whether a specific example was in the training set.

3. Why is the copyrightability of LLM outputs legally uncertain?

Show Answer
Copyright law requires human authorship. Purely AI-generated content without substantial human creative contribution does not qualify for copyright protection in most jurisdictions. However, the degree of human involvement (through prompting, editing, and selection) that crosses the threshold of "substantial creative contribution" is not well defined and varies by jurisdiction.

4. What is the difference between anonymization and pseudonymization?

Show Answer
Anonymization irreversibly removes identifying information so that the data can never be linked back to an individual. Pseudonymization replaces identifiers with consistent pseudonyms (tokens) that can be reversed with a mapping table. Under GDPR, pseudonymized data is still personal data (subject to GDPR), while properly anonymized data falls outside GDPR's scope.

5. Can fine-tuning a model under a restrictive license create a new, unrestricted model?

Show Answer
No. Fine-tuned models inherit the license restrictions of the base model. A model fine-tuned from Llama 3 is still subject to the Llama Community License, including the MAU threshold and acceptable use policy. The fine-tuned weights are a derivative work, and the original license terms propagate. Always check whether the base model's license permits your intended use before investing in fine-tuning.

Key Takeaways