Module 13 · Section 13.4

Fine-Tuning via Provider APIs

Leveraging OpenAI, Google Vertex AI, and other provider APIs for managed fine-tuning, with analysis of trade-offs between ease, control, and cost
★ Big Picture

Not every team needs to manage their own GPU cluster. Provider APIs offer a managed fine-tuning experience where you upload your data, configure a few parameters, and receive a fine-tuned model endpoint. This approach trades control and flexibility for simplicity and speed. This section covers the two most widely used provider APIs (OpenAI and Google Vertex AI), walks through complete workflows for each, and provides a framework for deciding when managed fine-tuning is the right choice versus self-hosted training.

1. OpenAI Fine-Tuning API

OpenAI's fine-tuning API is the most accessible entry point for teams new to fine-tuning. It supports GPT-4o, GPT-4o-mini, and GPT-3.5-turbo models with a straightforward workflow: prepare data in JSONL format, upload the file, create a fine-tuning job, and use the resulting model through the standard chat completions API.

1.1 Data Preparation for OpenAI

OpenAI requires training data in JSONL format with the ChatML messages structure. Each line contains a JSON object with a messages array. The system message is optional but recommended for consistent behavior.

import json
from typing import List, Dict

def prepare_openai_training_file(
    examples: List[Dict],
    output_path: str,
    validate: bool = True
) -> dict:
    """Prepare and validate a JSONL file for OpenAI fine-tuning."""
    stats = {"total": 0, "valid": 0, "errors": [], "token_estimates": []}

    with open(output_path, "w") as f:
        for i, example in enumerate(examples):
            stats["total"] += 1
            messages = example.get("messages", [])

            if validate:
                # Validate message structure
                if not messages:
                    stats["errors"].append(f"Example {i}: empty messages")
                    continue

                roles = [m["role"] for m in messages]

                # Must end with assistant message
                if roles[-1] != "assistant":
                    stats["errors"].append(
                        f"Example {i}: last message must be 'assistant'"
                    )
                    continue

                # Check for valid roles
                valid_roles = {"system", "user", "assistant"}
                invalid = set(roles) - valid_roles
                if invalid:
                    stats["errors"].append(
                        f"Example {i}: invalid roles {invalid}"
                    )
                    continue

                # Rough token estimate (4 chars per token)
                total_chars = sum(len(m["content"]) for m in messages)
                estimated_tokens = total_chars // 4
                stats["token_estimates"].append(estimated_tokens)

            f.write(json.dumps(example) + "\n")
            stats["valid"] += 1

    if stats["token_estimates"]:
        import numpy as np
        tokens = np.array(stats["token_estimates"])
        stats["token_summary"] = {
            "mean": int(tokens.mean()),
            "median": int(np.median(tokens)),
            "p95": int(np.percentile(tokens, 95)),
            "total_training_tokens": int(tokens.sum()),
        }

    print(f"Prepared {stats['valid']}/{stats['total']} examples")
    if stats["errors"]:
        print(f"Errors: {len(stats['errors'])}")
        for err in stats["errors"][:5]:
            print(f"  {err}")

    return stats

# Example usage
examples = [
    {
        "messages": [
            {"role": "system", "content": "You are a helpful customer support agent."},
            {"role": "user", "content": "How do I reset my password?"},
            {"role": "assistant", "content": "To reset your password, go to Settings, "
                "then Security, and click 'Reset Password'. You will receive an email "
                "with a reset link within 5 minutes."}
        ]
    },
    # ... more examples
]

stats = prepare_openai_training_file(examples, "train.jsonl")

1.2 Creating and Monitoring a Fine-Tuning Job

from openai import OpenAI
import time

client = OpenAI()  # Uses OPENAI_API_KEY env var

# Step 1: Upload training file
with open("train.jsonl", "rb") as f:
    training_file = client.files.create(file=f, purpose="fine-tune")
print(f"Uploaded file: {training_file.id}")

# Step 2: (Optional) Upload validation file
with open("val.jsonl", "rb") as f:
    validation_file = client.files.create(file=f, purpose="fine-tune")

# Step 3: Create fine-tuning job
job = client.fine_tuning.jobs.create(
    training_file=training_file.id,
    validation_file=validation_file.id,
    model="gpt-4o-mini-2024-07-18",
    hyperparameters={
        "n_epochs": 3,                    # Number of epochs
        "learning_rate_multiplier": 1.8,  # Relative to default
        "batch_size": 4,                  # Auto if not specified
    },
    suffix="customer-support-v1",  # Custom model name suffix
)
print(f"Job created: {job.id}")
print(f"Status: {job.status}")

# Step 4: Monitor training progress
def monitor_fine_tuning(job_id: str, poll_interval: int = 60):
    """Poll a fine-tuning job until completion."""
    while True:
        job = client.fine_tuning.jobs.retrieve(job_id)
        print(f"Status: {job.status}")

        # Check for events (training metrics)
        events = client.fine_tuning.jobs.list_events(
            fine_tuning_job_id=job_id, limit=5
        )
        for event in events.data:
            print(f"  [{event.created_at}] {event.message}")

        if job.status in ("succeeded", "failed", "cancelled"):
            break

        time.sleep(poll_interval)

    if job.status == "succeeded":
        print(f"\nFine-tuned model: {job.fine_tuned_model}")
        return job.fine_tuned_model
    else:
        print(f"\nJob {job.status}: {job.error}")
        return None

model_name = monitor_fine_tuning(job.id)
Uploaded file: file-abc123 Job created: ftjob-xyz789 Status: validating_files Status: running [1711234567] Step 100/300: training loss=1.234 [1711234600] Step 200/300: training loss=0.876 Status: running [1711234633] Step 300/300: training loss=0.654 Status: succeeded Fine-tuned model: ft:gpt-4o-mini-2024-07-18:my-org:customer-support-v1:9abc123

1.3 Using the Fine-Tuned Model

# Step 5: Use the fine-tuned model (identical to standard API calls)
response = client.chat.completions.create(
    model="ft:gpt-4o-mini-2024-07-18:my-org:customer-support-v1:9abc123",
    messages=[
        {"role": "system", "content": "You are a helpful customer support agent."},
        {"role": "user", "content": "I can't find my order confirmation email."},
    ],
    temperature=0.7,
    max_tokens=200,
)

print(response.choices[0].message.content)
📝 Note

OpenAI fine-tuning pricing. You pay for training tokens (the number of tokens in your dataset multiplied by the number of epochs) and for inference on the fine-tuned model (which is more expensive per token than the base model). For GPT-4o-mini, training costs approximately $3.00 per million tokens and inference costs $0.30/$1.20 per million input/output tokens. Always estimate total cost before starting a job, especially with large datasets.

2. Google Vertex AI Fine-Tuning

Google Vertex AI provides fine-tuning for Gemini models with a similar managed experience. The workflow uses the Google Cloud SDK and supports both supervised fine-tuning and RLHF-style tuning. Vertex AI gives you slightly more control over hyperparameters compared to OpenAI.

2.1 Vertex AI Workflow

import vertexai
from vertexai.tuning import sft as vertex_sft
from google.cloud import storage

# Initialize Vertex AI
vertexai.init(project="my-project-id", location="us-central1")

# Step 1: Upload training data to GCS
# Vertex AI expects data in GCS (Google Cloud Storage)
# Format: JSONL with same messages structure as OpenAI

def upload_to_gcs(local_path: str, bucket_name: str, blob_name: str) -> str:
    """Upload training data to Google Cloud Storage."""
    client = storage.Client()
    bucket = client.bucket(bucket_name)
    blob = bucket.blob(blob_name)
    blob.upload_from_filename(local_path)
    gcs_uri = f"gs://{bucket_name}/{blob_name}"
    print(f"Uploaded to {gcs_uri}")
    return gcs_uri

train_uri = upload_to_gcs(
    "train.jsonl",
    "my-training-bucket",
    "fine-tuning/medical-qa/train.jsonl"
)
val_uri = upload_to_gcs(
    "val.jsonl",
    "my-training-bucket",
    "fine-tuning/medical-qa/val.jsonl"
)

# Step 2: Create supervised fine-tuning job
sft_tuning_job = vertex_sft.train(
    source_model="gemini-1.5-flash-002",
    train_dataset=train_uri,
    validation_dataset=val_uri,
    epochs=3,
    adapter_size=4,        # LoRA rank (1, 4, 8, or 16)
    learning_rate_multiplier=1.0,
    tuned_model_display_name="medical-qa-gemini-v1",
)

# Step 3: Monitor (blocking call)
print(f"Job resource: {sft_tuning_job.resource_name}")

# Poll for completion
while not sft_tuning_job.has_ended:
    time.sleep(60)
    sft_tuning_job.refresh()
    print(f"State: {sft_tuning_job.state}")

# Step 4: Get the tuned model endpoint
tuned_model = sft_tuning_job.tuned_model_endpoint_name
print(f"Tuned model endpoint: {tuned_model}")

2.2 Using the Vertex AI Fine-Tuned Model

from vertexai.generative_models import GenerativeModel

# Load the fine-tuned model
model = GenerativeModel(
    model_name=tuned_model,  # Endpoint from training
)

# Generate responses
response = model.generate_content(
    "Patient presents with recurring headaches and blurred vision. "
    "Suggest differential diagnoses.",
    generation_config={
        "temperature": 0.3,
        "max_output_tokens": 500,
    }
)

print(response.text)

3. Provider Comparison

API Fine-Tuning vs. Self-Hosted: Decision Factors Provider API Fine-Tuning ✔ No GPU infrastructure needed ✔ Minutes to start training ✔ Automatic scaling and serving ✔ Built-in monitoring dashboard ✘ Limited hyperparameter control ✘ Data sent to third party ✘ Vendor lock-in risk ✘ Higher per-token inference cost Self-Hosted Fine-Tuning ✔ Full hyperparameter control ✔ Data stays on your infrastructure ✔ Any model (open-weight) ✔ Lower cost at scale ✘ GPU procurement and management ✘ ML engineering expertise required ✘ Serving infrastructure needed ✘ Days to weeks for setup
Figure 13.9: Provider API fine-tuning vs. self-hosted: each approach has distinct advantages and limitations
AspectOpenAIGoogle Vertex AISelf-Hosted (TRL)
Available modelsGPT-4o, GPT-4o-mini, GPT-3.5Gemini 1.5 Flash, Gemini 1.5 ProAny open-weight model
Data formatJSONL (ChatML)JSONL (ChatML)Any (ChatML, Alpaca, ShareGPT)
Max training examples50 million tokens10,000 examplesUnlimited
Hyperparameter controlEpochs, LR multiplier, batch sizeEpochs, LR multiplier, adapter sizeFull control over all parameters
Training cost (10K examples)~$15 to $50 (GPT-4o-mini)~$10 to $40 (Gemini Flash)$5 to $20 (cloud GPU rental)
Time to first result30 min to 2 hours1 to 3 hoursHours to days (setup + training)
Data privacyData processed by OpenAIData processed by GoogleData stays on your servers
Model weights accessNo (API only)No (API only)Full access
ServingIncluded (pay per token)Included (pay per token)Self-managed (vLLM, TGI)

4. Cost Analysis Framework

The true cost of API fine-tuning depends on your training data size, the number of epochs, and your expected inference volume. The following calculator helps you estimate and compare costs across providers and approaches.

from dataclasses import dataclass

@dataclass
class FineTuningCostEstimate:
    """Compare fine-tuning costs across providers."""

    # Dataset parameters
    num_examples: int = 10_000
    avg_tokens_per_example: int = 500
    num_epochs: int = 3

    # Inference parameters (monthly)
    monthly_requests: int = 100_000
    avg_input_tokens: int = 300
    avg_output_tokens: int = 150

    def openai_cost(self, model: str = "gpt-4o-mini") -> dict:
        """Estimate OpenAI fine-tuning + inference costs."""
        # Training pricing (per 1M tokens)
        training_prices = {
            "gpt-4o-mini": {"train": 3.00},
            "gpt-4o": {"train": 25.00},
        }
        # Inference pricing (per 1M tokens, fine-tuned models)
        inference_prices = {
            "gpt-4o-mini": {"input": 0.30, "output": 1.20},
            "gpt-4o": {"input": 3.75, "output": 15.00},
        }

        train_price = training_prices[model]
        infer_price = inference_prices[model]

        # Training cost
        total_training_tokens = (
            self.num_examples * self.avg_tokens_per_example * self.num_epochs
        )
        training_cost = (total_training_tokens / 1_000_000) * train_price["train"]

        # Monthly inference cost
        monthly_input_tokens = self.monthly_requests * self.avg_input_tokens
        monthly_output_tokens = self.monthly_requests * self.avg_output_tokens
        monthly_inference = (
            (monthly_input_tokens / 1_000_000) * infer_price["input"] +
            (monthly_output_tokens / 1_000_000) * infer_price["output"]
        )

        return {
            "provider": f"OpenAI ({model})",
            "training_cost": f"${training_cost:.2f}",
            "monthly_inference": f"${monthly_inference:.2f}",
            "annual_total": f"${training_cost + monthly_inference * 12:.2f}",
            "training_tokens": f"{total_training_tokens:,}",
        }

    def self_hosted_cost(self, gpu_hourly: float = 2.50) -> dict:
        """Estimate self-hosted fine-tuning costs."""
        # Rough estimate: ~10K tokens/second on A100
        total_training_tokens = (
            self.num_examples * self.avg_tokens_per_example * self.num_epochs
        )
        training_hours = total_training_tokens / (10_000 * 3600)
        training_cost = training_hours * gpu_hourly

        # Serving: dedicated GPU instance
        serving_monthly = gpu_hourly * 24 * 30  # Always-on

        return {
            "provider": "Self-hosted (A100)",
            "training_cost": f"${training_cost:.2f}",
            "monthly_inference": f"${serving_monthly:.2f}",
            "annual_total": f"${training_cost + serving_monthly * 12:.2f}",
            "training_tokens": f"{total_training_tokens:,}",
        }

# Compare costs
estimator = FineTuningCostEstimate(
    num_examples=10_000,
    monthly_requests=100_000,
)

for result in [
    estimator.openai_cost("gpt-4o-mini"),
    estimator.openai_cost("gpt-4o"),
    estimator.self_hosted_cost(),
]:
    print(f"\n{result['provider']}:")
    for k, v in result.items():
        if k != "provider":
            print(f"  {k}: {v}")
OpenAI (gpt-4o-mini): training_cost: $45.00 monthly_inference: $27.00 annual_total: $369.00 training_tokens: 15,000,000 OpenAI (gpt-4o): training_cost: $375.00 monthly_inference: $337.50 annual_total: $4,425.00 training_tokens: 15,000,000 Self-hosted (A100): training_cost: $1.04 monthly_inference: $1,800.00 annual_total: $21,601.04 training_tokens: 15,000,000
🔑 Key Insight

The breakeven point is about volume. API fine-tuning is cheaper at low to moderate inference volumes (under 500K requests per month for GPT-4o-mini). Self-hosted becomes cheaper at high volumes because you pay a fixed infrastructure cost regardless of how many requests you serve. For most startups and early-stage projects, API fine-tuning is the right starting point. Transition to self-hosted when your monthly API bill consistently exceeds the cost of a dedicated GPU instance.

⚠ Warning

Data privacy is non-negotiable for some industries. If you work in healthcare (HIPAA), finance (SOC 2), or government (FedRAMP), sending training data to a third-party API may violate compliance requirements. Always verify that your provider's data handling policies meet your regulatory obligations before uploading any data. When in doubt, use self-hosted fine-tuning to keep data within your controlled environment.

5. Best Practices for API Fine-Tuning

5.1 Iterative Refinement Workflow

Iterative API Fine-Tuning Workflow 1. Start Small 100 to 500 examples 2. Train & Eval 1 to 2 epochs first 3. Error Analysis Review failures 4. Improve Data Fix + add examples Repeat until quality target met
Figure 13.10: Start with a small dataset, evaluate, identify failure patterns, and iteratively improve
📝 Note

Start with 100 to 500 examples. Many teams over-invest in data collection before validating that fine-tuning will work for their use case. Begin with a small, high-quality dataset and run a quick fine-tuning job. If the results are promising, scale up the data. If the model does not improve, the problem may be with your task framing, data quality, or prompt design rather than data quantity.

Section 13.4 Quiz

1. What data format does OpenAI's fine-tuning API require?
Show Answer
OpenAI requires JSONL (JSON Lines) format where each line is a JSON object containing a messages array with role/content pairs. The roles must be "system" (optional), "user", and "assistant". The last message in each example must have the "assistant" role, as this is the response the model will learn to generate.
2. At what monthly request volume does self-hosted fine-tuning typically become cheaper than API fine-tuning with GPT-4o-mini?
Show Answer
Self-hosted becomes cheaper at approximately 500,000+ requests per month. Below this volume, the fixed cost of a dedicated GPU instance (roughly $1,800/month for an A100) exceeds the variable API costs. Above this threshold, the per-token savings from self-hosting accumulate enough to offset the infrastructure overhead. The exact breakeven depends on average sequence length and GPU utilization.
3. Why might a team choose Google Vertex AI fine-tuning over OpenAI for the same task?
Show Answer
Key reasons include: (1) The team already uses Google Cloud infrastructure and wants to keep data within their GCP environment. (2) Vertex AI supports adapter size selection (LoRA rank), giving more control over the efficiency/quality tradeoff. (3) Gemini models may perform better on certain multilingual or multimodal tasks. (4) Regulatory requirements mandate using a specific cloud provider. (5) Pricing may be more favorable for their specific usage pattern.
4. What is the recommended approach when starting API fine-tuning for a new use case?
Show Answer
Start with a small dataset of 100 to 500 high-quality examples and run a quick 1 to 2 epoch training job. Evaluate the results against your quality targets. If the results are promising, perform error analysis on the failures, add targeted examples to address the failure patterns, and retrain. Repeat this iterative cycle, gradually scaling up the dataset. This approach avoids investing weeks in data collection before validating that fine-tuning is viable for the task.
5. What are two key limitations of API-based fine-tuning compared to self-hosted training?
Show Answer
(1) Limited hyperparameter control: API providers expose only a few hyperparameters (epochs, learning rate multiplier, batch size), while self-hosted training gives you full control over the optimizer, scheduler, gradient accumulation, PEFT configuration, data collation, and more. (2) No access to model weights: with API fine-tuning, you can only use the model through the provider's API. You cannot export the weights, run the model locally, apply further techniques like quantization, or switch serving infrastructure.

Key Takeaways