Capstone: Requirements & Deliverables

★ Big Picture

The capstone is not a collection of independent components; it is a single integrated system where every piece must work together. The requirements below define the minimum bar for each component. Strong projects will go beyond the minimum in areas where the chosen use case demands it. The scoring rubric weights integration and evaluation more heavily than any individual component, because the ability to make an entire system work end-to-end is the most valuable skill this course develops.

C.1: Technical Requirements

Your capstone system must include all of the following components. Each requirement maps to one or more course modules, demonstrating that you have synthesized the material into a working system.

Requirement 1: Synthetic Dataset

Modules: 05, 12, 13

Create or curate a dataset of at least 1,000 examples suitable for fine-tuning or evaluation. The dataset must include train/validation/test splits with no data leakage between splits. If using synthetic generation, document the generation pipeline including the seed prompts, filtering criteria, and quality checks applied. Publish the dataset to Hugging Face Hub with a complete dataset card (description, intended use, limitations, licensing).

Minimum 1,000 examples with train/val/test splits
Documented generation or curation pipeline
Quality filtering with rejection rate reported
Published on Hugging Face Hub with dataset card

Requirement 2: Fine-Tuned Model

Modules: 12, 13, 14, 15

Fine-tune or adapt a language model using at least one technique from the course: full fine-tuning, LoRA/QLoRA, prompt tuning, or prefix tuning. Report training hyperparameters, loss curves, and evaluation metrics on the held-out test set. Compare the fine-tuned model against the base model on at least 3 evaluation dimensions. Publish the model (or adapter weights) to Hugging Face Hub with a model card.

At least one adaptation technique applied
Training loss curves and hyperparameter documentation
Comparison against base model on 3+ evaluation dimensions
Published on Hugging Face Hub with model card

Requirement 3: RAG System

Modules: 18, 19, 20

Implement a retrieval-augmented generation pipeline that includes document ingestion, chunking, embedding, vector storage, retrieval with at least one reranking step, and answer generation with inline citations. The system must handle at least 100 documents and demonstrate that retrieval improves answer quality compared to the model alone.

Document ingestion and chunking pipeline
Vector store with at least 100 documents indexed
Retrieval with reranking (cross-encoder or similar)
Citation generation linking claims to source documents
Ablation: model-only vs. RAG quality comparison

Requirement 4: Agent with Tools

Modules: 21, 22

Build an agent that can use at least 3 external tools (for example: search, calculator, database query, API call, code execution). The agent must demonstrate multi-step reasoning where the output of one tool informs the next action. Include error handling for tool failures and a maximum iteration limit to prevent infinite loops.

At least 3 distinct tools integrated
Multi-step reasoning with tool chaining
Error handling and graceful degradation
Maximum iteration safety limit
Logging of agent trajectory (thought, action, observation)

Requirement 5: Deep Research Capability

Modules: 20, 21, 22

Implement a deep research or multi-hop retrieval feature where the system can answer complex questions that require synthesizing information from multiple sources. This could be a research agent that plans queries, retrieves from multiple knowledge bases, cross-references findings, and produces a structured synthesis with citations.

Multi-source information synthesis
Query planning and decomposition
Cross-referencing across retrieved documents
Structured output with citations from multiple sources

Requirement 6: Production Deployment

Modules: 26, 27

Deploy the system as a running service (cloud VM, container, serverless, or equivalent). The deployment must include a health check endpoint, structured logging, and the ability to handle at least 5 concurrent requests. Provide deployment instructions that allow a reviewer to reproduce the deployment.

Running service accessible via API or web interface
Health check endpoint
Structured logging (JSON format recommended)
Support for concurrent requests
Reproducible deployment instructions (Docker preferred)

Requirement 7: Security and Safety

Module: 26

Implement at least 3 safety mechanisms: input validation (prompt injection defense), output filtering (toxicity, PII redaction), and rate limiting. Document the threat model for your specific use case and explain which risks are mitigated and which are accepted with justification.

Input validation with prompt injection defense
Output filtering (toxicity detection, PII redaction)
Rate limiting
Documented threat model

Requirement 8: Evaluation Suite

Module: 25

Design and run a comprehensive evaluation suite that covers at least 4 of the following: automated metrics (BLEU, ROUGE, BERTScore), LLM-as-Judge evaluation, human evaluation (at least 50 examples), RAG-specific metrics (context relevance, faithfulness, answer relevance), agent trajectory evaluation, latency and throughput benchmarks, and adversarial testing (red teaming).

At least 4 evaluation methods from the list above
Statistical reporting (confidence intervals or significance tests)
Comparison across at least 2 system configurations
Results presented in tables and visualizations

Requirement 9: Hybrid Architecture

Modules: 09, 10, 14, 19

The system must use at least 2 different models or model configurations in a meaningful way. Examples: a small model for classification/routing plus a large model for generation; a fine-tuned model for domain tasks plus a general model for open-ended questions; an embedding model plus a generation model with a reranker in between.

At least 2 distinct models used for different purposes
Clear justification for why each model is used
Cost and latency comparison across the hybrid configuration

Requirement 10: ROI Analysis

Module: 27

Provide a business case for the system including cost breakdown (compute, API, development, maintenance), estimated value (labor savings, quality improvement, revenue impact), ROI calculation, and payback period. Use realistic assumptions and document all sources for cost and value estimates.

Itemized cost breakdown (development, infrastructure, API, maintenance)
Value estimation with documented assumptions
12-month ROI calculation
Payback period analysis

Requirement 11: Risk and Governance

Module: 26

Document the risks specific to your system (hallucination, bias, privacy, security, operational) and the governance framework you would recommend for production operation. Include an incident response plan for the most likely failure modes and a model update policy.

Risk register with likelihood and impact ratings
Governance framework recommendation
Incident response plan for top 3 failure modes
Model update and version management policy

📝 Note

Not every requirement needs to be implemented to the same depth. Choose 3 to 4 requirements that align most closely with your use case and implement them deeply. The remaining requirements should be addressed at least at a demonstrable level. The technical report should explain your prioritization decisions and why certain components received more investment than others.

Figure C.1: Reference architecture for the capstone system showing requirement coverage

C.2: Deliverables

Deliverable 1: GitHub Repository

The repository is the primary artifact of the capstone. It must contain all source code, configuration files, and documentation needed to understand, reproduce, and evaluate the system.

Component	Required Contents
README.md	Project overview, architecture diagram, setup instructions, usage examples
src/	Clean, documented source code with type hints and docstrings
tests/	Unit tests and integration tests with at least 60% code coverage
evaluation/	Evaluation scripts, metrics collection, and result analysis notebooks
deployment/	Dockerfile, docker-compose.yml, or equivalent deployment configs
docs/	Architecture decision records, API documentation
data/	Sample data (not full dataset; link to HF Hub for full version)
configs/	Model configs, prompt templates, environment variables template

Deliverable 2: Hugging Face Hub Artifacts

Publish two artifacts to Hugging Face Hub: the fine-tuned model (or adapter weights) and the curated/synthetic dataset. Both must include complete cards following Hugging Face conventions.

Model card: model description, intended use, training procedure, evaluation results, limitations, bias analysis, compute resources used
Dataset card: dataset description, creation process, column descriptions, data splits, intended use, known limitations, licensing information

⚡ Key Insight

Publishing to Hugging Face Hub is not just a deliverable checkbox. It demonstrates that your model and dataset are reproducible, documented, and usable by others. Incomplete model cards or datasets without proper documentation will receive reduced scores even if the underlying artifacts are technically sound.

Deliverable 3: Technical Report

The technical report is an 8 to 12 page document (excluding appendices) that covers the full lifecycle of the project. It is structured as follows:

Introduction (1 page): Problem statement, use case description, target users, and success criteria
Related Work (0.5 page): Brief survey of similar systems or approaches
Architecture (2 pages): System design, component interactions, technology choices with justification
Data and Model (1.5 pages): Dataset creation, model selection, training procedure, adaptation technique
Evaluation (2 pages): Methodology, results tables, visualizations, statistical analysis
Deployment and Operations (1 page): Infrastructure, monitoring, security measures
Business Case (1 page): ROI analysis, cost breakdown, risk governance
Limitations and Future Work (1 page): Honest assessment of weaknesses, failure modes, and planned improvements

⚠ Warning

The Limitations section is one of the most important parts of the report. Projects that claim everything works perfectly are less credible than those that honestly identify what does not work and explain why. Evaluators specifically look for depth of understanding in the limitations analysis. Describe at least 3 concrete failure modes you discovered during development and how you would address them with more time.

Deliverable 4: Interpretability Analysis

Provide an interpretability analysis of your model's behavior. This can take one or more of the following forms:

Attention analysis: Visualize attention patterns for representative inputs and explain what the model focuses on
Token attribution: Use integrated gradients, SHAP, or similar methods to identify which input tokens most influence the output
Probing experiments: Test whether intermediate representations encode specific features (entity types, sentiment, factual knowledge)
Behavioral testing: Systematically test model behavior across controlled input variations (CheckList-style testing)
Error analysis: Categorize failure modes and identify patterns in inputs that cause errors

Deliverable 5: Live Demo

Provide a live, accessible demo of the system OR a recorded screencast (5 to 10 minutes) walking through the key features. The demo must show:

A simple query that demonstrates basic functionality
A complex, multi-step query that exercises the agent and RAG components
A failure case showing how the system handles errors gracefully
The monitoring or observability dashboard (if applicable)

Deliverable 6: Presentation

Prepare and deliver a 15-minute presentation (12 minutes of content plus 3 minutes for questions). The presentation should be accessible to a mixed audience of technical engineers and business stakeholders.

Section	Time	Content
Motivation	2 min	Problem, use case, why it matters
Architecture	3 min	System design, key technology choices
Demo	3 min	Live or recorded walkthrough
Evaluation	2 min	Key results, comparison metrics
Lessons Learned	2 min	What worked, what failed, what you would change
Q&A	3 min	Questions from evaluators

Scoring Rubric

Category	Weight	What Evaluators Look For
System Integration	25%	All components work together; clean interfaces; no manual glue
Evaluation Rigor	20%	Multiple evaluation methods; statistical analysis; honest reporting
Technical Depth	20%	At least 3 requirements implemented deeply; strong code quality
Production Readiness	15%	Deployment, monitoring, safety, error handling
Documentation	10%	Technical report quality; model/dataset cards; code documentation
Presentation	10%	Clear communication; appropriate level for mixed audience

📝 Note

System Integration carries the highest weight (25%) because it is the hardest skill to develop and the most valuable in practice. An excellent RAG system that cannot connect to the agent, or a well-trained model that has no evaluation suite, demonstrates component-level skill but not the systems thinking that production work requires.

Example Project Ideas

The following are starting points for inspiration. Strong projects adapt these ideas to a specific domain or add unique twists that demonstrate personal technical depth.

Project	Domain	Key Technical Challenges
Legal Document Analyst	Legal Tech	Long-context RAG, citation accuracy, hallucination prevention
Code Review Assistant	Developer Tools	Multi-file context, code-specific evaluation, tool use (linters, tests)
Medical Triage Chatbot	Healthcare	Safety-critical outputs, evidence-based citations, regulatory compliance
Financial Research Agent	Finance	Multi-source research, numerical reasoning, real-time data tools
Educational Tutor	EdTech	Adaptive difficulty, misconception detection, Socratic dialogue
Customer Support System	SaaS	Ticket routing, knowledge base RAG, escalation logic, CSAT measurement

⚡ Key Insight

Choose a project where you have genuine interest or domain expertise. The capstone requires sustained effort over 4 to 6 weeks, and projects driven by curiosity consistently produce better results than projects chosen purely for technical impressiveness. Your domain knowledge will also make the evaluation more meaningful because you can assess whether the system's outputs are actually correct and useful.