Section 26.10: LLM Licensing, IP & Privacy

★ Big Picture

The legal landscape for LLMs is complex and unsettled. Model licenses range from fully open (Apache 2.0) to restrictive "open weight" releases with acceptable use policies. Who owns the intellectual property in LLM outputs remains legally uncertain. Training data copyright is actively litigated. And privacy requirements demand technical solutions like anonymization and differential privacy. Engineers must understand these issues to make defensible deployment decisions.

1. Model License Taxonomy

License Type	Commercial Use	Modification	Examples
Apache 2.0	Yes, unrestricted	Yes	Mistral 7B, Phi-3
MIT	Yes, unrestricted	Yes	Some small models
Llama Community	Yes (under 700M MAU)	Yes	Llama 3, Llama 3.1
Gemma Terms	Yes (with restrictions)	Yes	Gemma, Gemma 2
CC-BY-NC	No	Yes (non-commercial)	Some research models
Proprietary API	Per ToS	No access to weights	GPT-4o, Claude, Gemini

Figure 26.10.1: Model licenses range from fully open (Apache 2.0) to proprietary API-only access with no weight availability.

def check_license_compatibility(model_license: str, use_case: dict) -> dict:
    """Check if a model license permits the intended use case."""
    rules = {
        "apache-2.0": {"commercial": True, "modification": True, "distribution": True},
        "llama-community": {"commercial": True, "modification": True,
                            "distribution": True, "mau_limit": 700_000_000},
        "cc-by-nc-4.0": {"commercial": False, "modification": True, "distribution": True},
    }

    license_rules = rules.get(model_license, {})
    issues = []

    if use_case.get("commercial") and not license_rules.get("commercial"):
        issues.append("Commercial use not permitted")
    if use_case.get("mau", 0) > license_rules.get("mau_limit", float("inf")):
        issues.append(f"MAU exceeds license limit of {license_rules['mau_limit']:,}")

    return {"compatible": len(issues) == 0, "issues": issues}

result = check_license_compatibility(
    "llama-community",
    {"commercial": True, "mau": 500_000}
)
print(result)

{'compatible': True, 'issues': []}

2. Differential Privacy for LLM Training

import torch
import numpy as np

def dp_sgd_step(gradients: list, clip_norm: float, noise_scale: float, lr: float):
    """Simulate a DP-SGD gradient step with clipping and noise."""
    clipped = []
    for grad in gradients:
        grad_norm = np.linalg.norm(grad)
        clip_factor = min(1.0, clip_norm / (grad_norm + 1e-8))
        clipped.append(grad * clip_factor)

    # Average clipped gradients
    avg_grad = np.mean(clipped, axis=0)

    # Add calibrated Gaussian noise
    noise = np.random.normal(0, noise_scale * clip_norm / len(gradients), avg_grad.shape)
    noisy_grad = avg_grad + noise

    return {
        "update": -lr * noisy_grad,
        "avg_clip_factor": np.mean([min(1, clip_norm / (np.linalg.norm(g) + 1e-8)) for g in gradients]),
        "noise_magnitude": np.linalg.norm(noise),
    }

# Simulate with 4 per-sample gradients
grads = [np.random.randn(10) * s for s in [0.5, 2.0, 0.3, 1.5]]
result = dp_sgd_step(grads, clip_norm=1.0, noise_scale=0.5, lr=0.01)
print(f"Avg clip factor: {result['avg_clip_factor']:.3f}")
print(f"Noise magnitude: {result['noise_magnitude']:.4f}")

3. IP Ownership of LLM Outputs

Figure 26.10.2: IP ownership questions span training data rights, output copyrightability, and fine-tuned model status.

def anonymize_text(text: str, entities: dict) -> str:
    """Replace identified entities with consistent pseudonyms."""
    pseudonym_map = {}
    counter = {}

    for entity_type, values in entities.items():
        counter[entity_type] = 0
        for value in values:
            if value not in pseudonym_map:
                counter[entity_type] += 1
                pseudonym_map[value] = f"[{entity_type}_{counter[entity_type]}]"

    result = text
    for original, pseudonym in pseudonym_map.items():
        result = result.replace(original, pseudonym)

    return result

text = "John Smith from Acme Corp called about order 12345."
anon = anonymize_text(text, {
    "PERSON": ["John Smith"],
    "ORG": ["Acme Corp"],
    "ID": ["12345"],
})
print(anon)

[PERSON_1] from [ORG_1] called about order [ID_1].

⚠ Warning

The Llama Community License requires that applications with more than 700 million monthly active users must request a separate license from Meta. If your product approaches this scale, you need a commercial agreement. Always read the full license terms, not just the summary, before deploying any model commercially.

📝 Note

Differential privacy provides a mathematical guarantee that any individual training example has limited influence on the trained model. The privacy budget (epsilon) controls the tradeoff: lower epsilon means stronger privacy but noisier gradients, typically reducing model quality. Current DP-SGD for LLMs remains an active research area, as the privacy-utility tradeoff is still steep for large models.

★ Key Insight

In most jurisdictions, AI-generated content cannot be copyrighted because copyright requires human authorship. However, if a human provides substantial creative direction in the prompt and edits the output, the resulting work may qualify for copyright protection. The boundary is unclear and varies by jurisdiction.

Knowledge Check

1. What is the key restriction in the Llama Community License compared to Apache 2.0?

Show Answer

The Llama Community License includes a monthly active user (MAU) threshold of 700 million, above which a separate commercial license from Meta is required. It also includes an acceptable use policy that prohibits certain harmful applications. Apache 2.0 has no such usage restrictions.

2. How does differential privacy protect individual training examples?

Show Answer

DP-SGD clips per-sample gradients to a maximum norm and adds calibrated Gaussian noise to the averaged gradient. This ensures that any single training example has limited influence on the final model parameters (bounded by the privacy budget epsilon). An attacker cannot determine with confidence whether a specific example was in the training set.

3. Why is the copyrightability of LLM outputs legally uncertain?

Show Answer

Copyright law requires human authorship. Purely AI-generated content without substantial human creative contribution does not qualify for copyright protection in most jurisdictions. However, the degree of human involvement (through prompting, editing, and selection) that crosses the threshold of "substantial creative contribution" is not well defined and varies by jurisdiction.

4. What is the difference between anonymization and pseudonymization?

Show Answer

Anonymization irreversibly removes identifying information so that the data can never be linked back to an individual. Pseudonymization replaces identifiers with consistent pseudonyms (tokens) that can be reversed with a mapping table. Under GDPR, pseudonymized data is still personal data (subject to GDPR), while properly anonymized data falls outside GDPR's scope.

5. Can fine-tuning a model under a restrictive license create a new, unrestricted model?

Show Answer

No. Fine-tuned models inherit the license restrictions of the base model. A model fine-tuned from Llama 3 is still subject to the Llama Community License, including the MAU threshold and acceptable use policy. The fine-tuned weights are a derivative work, and the original license terms propagate. Always check whether the base model's license permits your intended use before investing in fine-tuning.

Key Takeaways

Model licenses range from fully open (Apache 2.0) to restrictive "open weight" releases; always read the full terms before commercial deployment.
Fine-tuned models inherit the base model's license restrictions; there is no way to "fine-tune away" license obligations.
AI-generated content generally cannot be copyrighted, but substantial human creative contribution may change this analysis.
Differential privacy (DP-SGD) provides mathematical guarantees about individual training data influence, at the cost of model quality.
Pseudonymization replaces identifiers with reversible tokens; anonymization is irreversible. Only anonymized data falls outside GDPR scope.
Training data copyright is actively litigated; the fair use defense for AI training remains legally unsettled.