Not every team needs to manage their own GPU cluster. Provider APIs offer a managed fine-tuning experience where you upload your data, configure a few parameters, and receive a fine-tuned model endpoint. This approach trades control and flexibility for simplicity and speed. This section covers the two most widely used provider APIs (OpenAI and Google Vertex AI), walks through complete workflows for each, and provides a framework for deciding when managed fine-tuning is the right choice versus self-hosted training.
1. OpenAI Fine-Tuning API
OpenAI's fine-tuning API is the most accessible entry point for teams new to fine-tuning. It supports GPT-4o, GPT-4o-mini, and GPT-3.5-turbo models with a straightforward workflow: prepare data in JSONL format, upload the file, create a fine-tuning job, and use the resulting model through the standard chat completions API.
1.1 Data Preparation for OpenAI
OpenAI requires training data in JSONL format with the ChatML messages structure. Each line contains a JSON object with a messages array. The system message is optional but recommended for consistent behavior.
import json
from typing import List, Dict
def prepare_openai_training_file(
examples: List[Dict],
output_path: str,
validate: bool = True
) -> dict:
"""Prepare and validate a JSONL file for OpenAI fine-tuning."""
stats = {"total": 0, "valid": 0, "errors": [], "token_estimates": []}
with open(output_path, "w") as f:
for i, example in enumerate(examples):
stats["total"] += 1
messages = example.get("messages", [])
if validate:
# Validate message structure
if not messages:
stats["errors"].append(f"Example {i}: empty messages")
continue
roles = [m["role"] for m in messages]
# Must end with assistant message
if roles[-1] != "assistant":
stats["errors"].append(
f"Example {i}: last message must be 'assistant'"
)
continue
# Check for valid roles
valid_roles = {"system", "user", "assistant"}
invalid = set(roles) - valid_roles
if invalid:
stats["errors"].append(
f"Example {i}: invalid roles {invalid}"
)
continue
# Rough token estimate (4 chars per token)
total_chars = sum(len(m["content"]) for m in messages)
estimated_tokens = total_chars // 4
stats["token_estimates"].append(estimated_tokens)
f.write(json.dumps(example) + "\n")
stats["valid"] += 1
if stats["token_estimates"]:
import numpy as np
tokens = np.array(stats["token_estimates"])
stats["token_summary"] = {
"mean": int(tokens.mean()),
"median": int(np.median(tokens)),
"p95": int(np.percentile(tokens, 95)),
"total_training_tokens": int(tokens.sum()),
}
print(f"Prepared {stats['valid']}/{stats['total']} examples")
if stats["errors"]:
print(f"Errors: {len(stats['errors'])}")
for err in stats["errors"][:5]:
print(f" {err}")
return stats
# Example usage
examples = [
{
"messages": [
{"role": "system", "content": "You are a helpful customer support agent."},
{"role": "user", "content": "How do I reset my password?"},
{"role": "assistant", "content": "To reset your password, go to Settings, "
"then Security, and click 'Reset Password'. You will receive an email "
"with a reset link within 5 minutes."}
]
},
# ... more examples
]
stats = prepare_openai_training_file(examples, "train.jsonl")
1.2 Creating and Monitoring a Fine-Tuning Job
from openai import OpenAI
import time
client = OpenAI() # Uses OPENAI_API_KEY env var
# Step 1: Upload training file
with open("train.jsonl", "rb") as f:
training_file = client.files.create(file=f, purpose="fine-tune")
print(f"Uploaded file: {training_file.id}")
# Step 2: (Optional) Upload validation file
with open("val.jsonl", "rb") as f:
validation_file = client.files.create(file=f, purpose="fine-tune")
# Step 3: Create fine-tuning job
job = client.fine_tuning.jobs.create(
training_file=training_file.id,
validation_file=validation_file.id,
model="gpt-4o-mini-2024-07-18",
hyperparameters={
"n_epochs": 3, # Number of epochs
"learning_rate_multiplier": 1.8, # Relative to default
"batch_size": 4, # Auto if not specified
},
suffix="customer-support-v1", # Custom model name suffix
)
print(f"Job created: {job.id}")
print(f"Status: {job.status}")
# Step 4: Monitor training progress
def monitor_fine_tuning(job_id: str, poll_interval: int = 60):
"""Poll a fine-tuning job until completion."""
while True:
job = client.fine_tuning.jobs.retrieve(job_id)
print(f"Status: {job.status}")
# Check for events (training metrics)
events = client.fine_tuning.jobs.list_events(
fine_tuning_job_id=job_id, limit=5
)
for event in events.data:
print(f" [{event.created_at}] {event.message}")
if job.status in ("succeeded", "failed", "cancelled"):
break
time.sleep(poll_interval)
if job.status == "succeeded":
print(f"\nFine-tuned model: {job.fine_tuned_model}")
return job.fine_tuned_model
else:
print(f"\nJob {job.status}: {job.error}")
return None
model_name = monitor_fine_tuning(job.id)
1.3 Using the Fine-Tuned Model
# Step 5: Use the fine-tuned model (identical to standard API calls)
response = client.chat.completions.create(
model="ft:gpt-4o-mini-2024-07-18:my-org:customer-support-v1:9abc123",
messages=[
{"role": "system", "content": "You are a helpful customer support agent."},
{"role": "user", "content": "I can't find my order confirmation email."},
],
temperature=0.7,
max_tokens=200,
)
print(response.choices[0].message.content)
OpenAI fine-tuning pricing. You pay for training tokens (the number of tokens in your dataset multiplied by the number of epochs) and for inference on the fine-tuned model (which is more expensive per token than the base model). For GPT-4o-mini, training costs approximately $3.00 per million tokens and inference costs $0.30/$1.20 per million input/output tokens. Always estimate total cost before starting a job, especially with large datasets.
2. Google Vertex AI Fine-Tuning
Google Vertex AI provides fine-tuning for Gemini models with a similar managed experience. The workflow uses the Google Cloud SDK and supports both supervised fine-tuning and RLHF-style tuning. Vertex AI gives you slightly more control over hyperparameters compared to OpenAI.
2.1 Vertex AI Workflow
import vertexai
from vertexai.tuning import sft as vertex_sft
from google.cloud import storage
# Initialize Vertex AI
vertexai.init(project="my-project-id", location="us-central1")
# Step 1: Upload training data to GCS
# Vertex AI expects data in GCS (Google Cloud Storage)
# Format: JSONL with same messages structure as OpenAI
def upload_to_gcs(local_path: str, bucket_name: str, blob_name: str) -> str:
"""Upload training data to Google Cloud Storage."""
client = storage.Client()
bucket = client.bucket(bucket_name)
blob = bucket.blob(blob_name)
blob.upload_from_filename(local_path)
gcs_uri = f"gs://{bucket_name}/{blob_name}"
print(f"Uploaded to {gcs_uri}")
return gcs_uri
train_uri = upload_to_gcs(
"train.jsonl",
"my-training-bucket",
"fine-tuning/medical-qa/train.jsonl"
)
val_uri = upload_to_gcs(
"val.jsonl",
"my-training-bucket",
"fine-tuning/medical-qa/val.jsonl"
)
# Step 2: Create supervised fine-tuning job
sft_tuning_job = vertex_sft.train(
source_model="gemini-1.5-flash-002",
train_dataset=train_uri,
validation_dataset=val_uri,
epochs=3,
adapter_size=4, # LoRA rank (1, 4, 8, or 16)
learning_rate_multiplier=1.0,
tuned_model_display_name="medical-qa-gemini-v1",
)
# Step 3: Monitor (blocking call)
print(f"Job resource: {sft_tuning_job.resource_name}")
# Poll for completion
while not sft_tuning_job.has_ended:
time.sleep(60)
sft_tuning_job.refresh()
print(f"State: {sft_tuning_job.state}")
# Step 4: Get the tuned model endpoint
tuned_model = sft_tuning_job.tuned_model_endpoint_name
print(f"Tuned model endpoint: {tuned_model}")
2.2 Using the Vertex AI Fine-Tuned Model
from vertexai.generative_models import GenerativeModel
# Load the fine-tuned model
model = GenerativeModel(
model_name=tuned_model, # Endpoint from training
)
# Generate responses
response = model.generate_content(
"Patient presents with recurring headaches and blurred vision. "
"Suggest differential diagnoses.",
generation_config={
"temperature": 0.3,
"max_output_tokens": 500,
}
)
print(response.text)
3. Provider Comparison
| Aspect | OpenAI | Google Vertex AI | Self-Hosted (TRL) |
|---|---|---|---|
| Available models | GPT-4o, GPT-4o-mini, GPT-3.5 | Gemini 1.5 Flash, Gemini 1.5 Pro | Any open-weight model |
| Data format | JSONL (ChatML) | JSONL (ChatML) | Any (ChatML, Alpaca, ShareGPT) |
| Max training examples | 50 million tokens | 10,000 examples | Unlimited |
| Hyperparameter control | Epochs, LR multiplier, batch size | Epochs, LR multiplier, adapter size | Full control over all parameters |
| Training cost (10K examples) | ~$15 to $50 (GPT-4o-mini) | ~$10 to $40 (Gemini Flash) | $5 to $20 (cloud GPU rental) |
| Time to first result | 30 min to 2 hours | 1 to 3 hours | Hours to days (setup + training) |
| Data privacy | Data processed by OpenAI | Data processed by Google | Data stays on your servers |
| Model weights access | No (API only) | No (API only) | Full access |
| Serving | Included (pay per token) | Included (pay per token) | Self-managed (vLLM, TGI) |
4. Cost Analysis Framework
The true cost of API fine-tuning depends on your training data size, the number of epochs, and your expected inference volume. The following calculator helps you estimate and compare costs across providers and approaches.
from dataclasses import dataclass
@dataclass
class FineTuningCostEstimate:
"""Compare fine-tuning costs across providers."""
# Dataset parameters
num_examples: int = 10_000
avg_tokens_per_example: int = 500
num_epochs: int = 3
# Inference parameters (monthly)
monthly_requests: int = 100_000
avg_input_tokens: int = 300
avg_output_tokens: int = 150
def openai_cost(self, model: str = "gpt-4o-mini") -> dict:
"""Estimate OpenAI fine-tuning + inference costs."""
# Training pricing (per 1M tokens)
training_prices = {
"gpt-4o-mini": {"train": 3.00},
"gpt-4o": {"train": 25.00},
}
# Inference pricing (per 1M tokens, fine-tuned models)
inference_prices = {
"gpt-4o-mini": {"input": 0.30, "output": 1.20},
"gpt-4o": {"input": 3.75, "output": 15.00},
}
train_price = training_prices[model]
infer_price = inference_prices[model]
# Training cost
total_training_tokens = (
self.num_examples * self.avg_tokens_per_example * self.num_epochs
)
training_cost = (total_training_tokens / 1_000_000) * train_price["train"]
# Monthly inference cost
monthly_input_tokens = self.monthly_requests * self.avg_input_tokens
monthly_output_tokens = self.monthly_requests * self.avg_output_tokens
monthly_inference = (
(monthly_input_tokens / 1_000_000) * infer_price["input"] +
(monthly_output_tokens / 1_000_000) * infer_price["output"]
)
return {
"provider": f"OpenAI ({model})",
"training_cost": f"${training_cost:.2f}",
"monthly_inference": f"${monthly_inference:.2f}",
"annual_total": f"${training_cost + monthly_inference * 12:.2f}",
"training_tokens": f"{total_training_tokens:,}",
}
def self_hosted_cost(self, gpu_hourly: float = 2.50) -> dict:
"""Estimate self-hosted fine-tuning costs."""
# Rough estimate: ~10K tokens/second on A100
total_training_tokens = (
self.num_examples * self.avg_tokens_per_example * self.num_epochs
)
training_hours = total_training_tokens / (10_000 * 3600)
training_cost = training_hours * gpu_hourly
# Serving: dedicated GPU instance
serving_monthly = gpu_hourly * 24 * 30 # Always-on
return {
"provider": "Self-hosted (A100)",
"training_cost": f"${training_cost:.2f}",
"monthly_inference": f"${serving_monthly:.2f}",
"annual_total": f"${training_cost + serving_monthly * 12:.2f}",
"training_tokens": f"{total_training_tokens:,}",
}
# Compare costs
estimator = FineTuningCostEstimate(
num_examples=10_000,
monthly_requests=100_000,
)
for result in [
estimator.openai_cost("gpt-4o-mini"),
estimator.openai_cost("gpt-4o"),
estimator.self_hosted_cost(),
]:
print(f"\n{result['provider']}:")
for k, v in result.items():
if k != "provider":
print(f" {k}: {v}")
The breakeven point is about volume. API fine-tuning is cheaper at low to moderate inference volumes (under 500K requests per month for GPT-4o-mini). Self-hosted becomes cheaper at high volumes because you pay a fixed infrastructure cost regardless of how many requests you serve. For most startups and early-stage projects, API fine-tuning is the right starting point. Transition to self-hosted when your monthly API bill consistently exceeds the cost of a dedicated GPU instance.
Data privacy is non-negotiable for some industries. If you work in healthcare (HIPAA), finance (SOC 2), or government (FedRAMP), sending training data to a third-party API may violate compliance requirements. Always verify that your provider's data handling policies meet your regulatory obligations before uploading any data. When in doubt, use self-hosted fine-tuning to keep data within your controlled environment.
5. Best Practices for API Fine-Tuning
5.1 Iterative Refinement Workflow
Start with 100 to 500 examples. Many teams over-invest in data collection before validating that fine-tuning will work for their use case. Begin with a small, high-quality dataset and run a quick fine-tuning job. If the results are promising, scale up the data. If the model does not improve, the problem may be with your task framing, data quality, or prompt design rather than data quantity.
Section 13.4 Quiz
Show Answer
messages array with role/content pairs. The roles must be "system" (optional), "user", and "assistant". The last message in each example must have the "assistant" role, as this is the response the model will learn to generate.Show Answer
Show Answer
Show Answer
Show Answer
Key Takeaways
- API fine-tuning is the fastest path from data to a deployed fine-tuned model, requiring no GPU infrastructure or ML engineering expertise beyond data preparation.
- OpenAI and Vertex AI both use the ChatML/messages JSONL format, making it easy to prepare data that works across providers.
- Start small (100 to 500 examples) and iterate. Do not invest weeks in data collection before validating that fine-tuning improves your specific task.
- API fine-tuning is cost-effective at low to moderate volumes (under 500K requests/month); self-hosted becomes cheaper at higher volumes.
- Data privacy requirements may mandate self-hosted fine-tuning in regulated industries (healthcare, finance, government).
- You trade control for convenience: API fine-tuning limits hyperparameter access and locks you into the provider's serving infrastructure.