Capstone · C.1 & C.2

Requirements & Deliverables

Detailed technical requirements for the capstone system and specifications for each deliverable
★ Big Picture

The capstone is not a collection of independent components; it is a single integrated system where every piece must work together. The requirements below define the minimum bar for each component. Strong projects will go beyond the minimum in areas where the chosen use case demands it. The scoring rubric weights integration and evaluation more heavily than any individual component, because the ability to make an entire system work end-to-end is the most valuable skill this course develops.

C.1: Technical Requirements

Your capstone system must include all of the following components. Each requirement maps to one or more course modules, demonstrating that you have synthesized the material into a working system.

Requirement 1: Synthetic Dataset

Modules: 05, 12, 13

Create or curate a dataset of at least 1,000 examples suitable for fine-tuning or evaluation. The dataset must include train/validation/test splits with no data leakage between splits. If using synthetic generation, document the generation pipeline including the seed prompts, filtering criteria, and quality checks applied. Publish the dataset to Hugging Face Hub with a complete dataset card (description, intended use, limitations, licensing).

Requirement 2: Fine-Tuned Model

Modules: 12, 13, 14, 15

Fine-tune or adapt a language model using at least one technique from the course: full fine-tuning, LoRA/QLoRA, prompt tuning, or prefix tuning. Report training hyperparameters, loss curves, and evaluation metrics on the held-out test set. Compare the fine-tuned model against the base model on at least 3 evaluation dimensions. Publish the model (or adapter weights) to Hugging Face Hub with a model card.

Requirement 3: RAG System

Modules: 18, 19, 20

Implement a retrieval-augmented generation pipeline that includes document ingestion, chunking, embedding, vector storage, retrieval with at least one reranking step, and answer generation with inline citations. The system must handle at least 100 documents and demonstrate that retrieval improves answer quality compared to the model alone.

Requirement 4: Agent with Tools

Modules: 21, 22

Build an agent that can use at least 3 external tools (for example: search, calculator, database query, API call, code execution). The agent must demonstrate multi-step reasoning where the output of one tool informs the next action. Include error handling for tool failures and a maximum iteration limit to prevent infinite loops.

Requirement 5: Deep Research Capability

Modules: 20, 21, 22

Implement a deep research or multi-hop retrieval feature where the system can answer complex questions that require synthesizing information from multiple sources. This could be a research agent that plans queries, retrieves from multiple knowledge bases, cross-references findings, and produces a structured synthesis with citations.

Requirement 6: Production Deployment

Modules: 26, 27

Deploy the system as a running service (cloud VM, container, serverless, or equivalent). The deployment must include a health check endpoint, structured logging, and the ability to handle at least 5 concurrent requests. Provide deployment instructions that allow a reviewer to reproduce the deployment.

Requirement 7: Security and Safety

Module: 26

Implement at least 3 safety mechanisms: input validation (prompt injection defense), output filtering (toxicity, PII redaction), and rate limiting. Document the threat model for your specific use case and explain which risks are mitigated and which are accepted with justification.

Requirement 8: Evaluation Suite

Module: 25

Design and run a comprehensive evaluation suite that covers at least 4 of the following: automated metrics (BLEU, ROUGE, BERTScore), LLM-as-Judge evaluation, human evaluation (at least 50 examples), RAG-specific metrics (context relevance, faithfulness, answer relevance), agent trajectory evaluation, latency and throughput benchmarks, and adversarial testing (red teaming).

Requirement 9: Hybrid Architecture

Modules: 09, 10, 14, 19

The system must use at least 2 different models or model configurations in a meaningful way. Examples: a small model for classification/routing plus a large model for generation; a fine-tuned model for domain tasks plus a general model for open-ended questions; an embedding model plus a generation model with a reranker in between.

Requirement 10: ROI Analysis

Module: 27

Provide a business case for the system including cost breakdown (compute, API, development, maintenance), estimated value (labor savings, quality improvement, revenue impact), ROI calculation, and payback period. Use realistic assumptions and document all sources for cost and value estimates.

Requirement 11: Risk and Governance

Module: 26

Document the risks specific to your system (hallucination, bias, privacy, security, operational) and the governance framework you would recommend for production operation. Include an incident response plan for the most likely failure modes and a model update policy.

📝 Note

Not every requirement needs to be implemented to the same depth. Choose 3 to 4 requirements that align most closely with your use case and implement them deeply. The remaining requirements should be addressed at least at a demonstrable level. The technical report should explain your prioritization decisions and why certain components received more investment than others.

User / API Client API Gateway + Rate Limiter (R7) Agent Router (R4, R5, R9) RAG Pipeline (R3) Vector DB + Reranker Fine-tuned Model (R2, R9) Generation + Citations Tool Executor (R4) Search, Calc, DB, API Evaluation Suite (R8) Monitoring + Logging (R6) R1: Data R10: ROI R11: Gov Reference architecture showing how all 11 requirements integrate into a single system
Figure C.1: Reference architecture for the capstone system showing requirement coverage

C.2: Deliverables

Deliverable 1: GitHub Repository

The repository is the primary artifact of the capstone. It must contain all source code, configuration files, and documentation needed to understand, reproduce, and evaluate the system.

Component Required Contents
README.md Project overview, architecture diagram, setup instructions, usage examples
src/ Clean, documented source code with type hints and docstrings
tests/ Unit tests and integration tests with at least 60% code coverage
evaluation/ Evaluation scripts, metrics collection, and result analysis notebooks
deployment/ Dockerfile, docker-compose.yml, or equivalent deployment configs
docs/ Architecture decision records, API documentation
data/ Sample data (not full dataset; link to HF Hub for full version)
configs/ Model configs, prompt templates, environment variables template

Deliverable 2: Hugging Face Hub Artifacts

Publish two artifacts to Hugging Face Hub: the fine-tuned model (or adapter weights) and the curated/synthetic dataset. Both must include complete cards following Hugging Face conventions.

⚡ Key Insight

Publishing to Hugging Face Hub is not just a deliverable checkbox. It demonstrates that your model and dataset are reproducible, documented, and usable by others. Incomplete model cards or datasets without proper documentation will receive reduced scores even if the underlying artifacts are technically sound.

Deliverable 3: Technical Report

The technical report is an 8 to 12 page document (excluding appendices) that covers the full lifecycle of the project. It is structured as follows:

  1. Introduction (1 page): Problem statement, use case description, target users, and success criteria
  2. Related Work (0.5 page): Brief survey of similar systems or approaches
  3. Architecture (2 pages): System design, component interactions, technology choices with justification
  4. Data and Model (1.5 pages): Dataset creation, model selection, training procedure, adaptation technique
  5. Evaluation (2 pages): Methodology, results tables, visualizations, statistical analysis
  6. Deployment and Operations (1 page): Infrastructure, monitoring, security measures
  7. Business Case (1 page): ROI analysis, cost breakdown, risk governance
  8. Limitations and Future Work (1 page): Honest assessment of weaknesses, failure modes, and planned improvements
⚠ Warning

The Limitations section is one of the most important parts of the report. Projects that claim everything works perfectly are less credible than those that honestly identify what does not work and explain why. Evaluators specifically look for depth of understanding in the limitations analysis. Describe at least 3 concrete failure modes you discovered during development and how you would address them with more time.

Deliverable 4: Interpretability Analysis

Provide an interpretability analysis of your model's behavior. This can take one or more of the following forms:

Deliverable 5: Live Demo

Provide a live, accessible demo of the system OR a recorded screencast (5 to 10 minutes) walking through the key features. The demo must show:

Deliverable 6: Presentation

Prepare and deliver a 15-minute presentation (12 minutes of content plus 3 minutes for questions). The presentation should be accessible to a mixed audience of technical engineers and business stakeholders.

Section Time Content
Motivation 2 min Problem, use case, why it matters
Architecture 3 min System design, key technology choices
Demo 3 min Live or recorded walkthrough
Evaluation 2 min Key results, comparison metrics
Lessons Learned 2 min What worked, what failed, what you would change
Q&A 3 min Questions from evaluators

Scoring Rubric

Category Weight What Evaluators Look For
System Integration 25% All components work together; clean interfaces; no manual glue
Evaluation Rigor 20% Multiple evaluation methods; statistical analysis; honest reporting
Technical Depth 20% At least 3 requirements implemented deeply; strong code quality
Production Readiness 15% Deployment, monitoring, safety, error handling
Documentation 10% Technical report quality; model/dataset cards; code documentation
Presentation 10% Clear communication; appropriate level for mixed audience
📝 Note

System Integration carries the highest weight (25%) because it is the hardest skill to develop and the most valuable in practice. An excellent RAG system that cannot connect to the agent, or a well-trained model that has no evaluation suite, demonstrates component-level skill but not the systems thinking that production work requires.

Example Project Ideas

The following are starting points for inspiration. Strong projects adapt these ideas to a specific domain or add unique twists that demonstrate personal technical depth.

Project Domain Key Technical Challenges
Legal Document Analyst Legal Tech Long-context RAG, citation accuracy, hallucination prevention
Code Review Assistant Developer Tools Multi-file context, code-specific evaluation, tool use (linters, tests)
Medical Triage Chatbot Healthcare Safety-critical outputs, evidence-based citations, regulatory compliance
Financial Research Agent Finance Multi-source research, numerical reasoning, real-time data tools
Educational Tutor EdTech Adaptive difficulty, misconception detection, Socratic dialogue
Customer Support System SaaS Ticket routing, knowledge base RAG, escalation logic, CSAT measurement
⚡ Key Insight

Choose a project where you have genuine interest or domain expertise. The capstone requires sustained effort over 4 to 6 weeks, and projects driven by curiosity consistently produce better results than projects chosen purely for technical impressiveness. Your domain knowledge will also make the evaluation more meaningful because you can assess whether the system's outputs are actually correct and useful.