The capstone is not a collection of independent components; it is a single integrated system where every piece must work together. The requirements below define the minimum bar for each component. Strong projects will go beyond the minimum in areas where the chosen use case demands it. The scoring rubric weights integration and evaluation more heavily than any individual component, because the ability to make an entire system work end-to-end is the most valuable skill this course develops.
C.1: Technical Requirements
Your capstone system must include all of the following components. Each requirement maps to one or more course modules, demonstrating that you have synthesized the material into a working system.
Requirement 1: Synthetic Dataset
Modules: 05, 12, 13
Create or curate a dataset of at least 1,000 examples suitable for fine-tuning or evaluation. The dataset must include train/validation/test splits with no data leakage between splits. If using synthetic generation, document the generation pipeline including the seed prompts, filtering criteria, and quality checks applied. Publish the dataset to Hugging Face Hub with a complete dataset card (description, intended use, limitations, licensing).
- Minimum 1,000 examples with train/val/test splits
- Documented generation or curation pipeline
- Quality filtering with rejection rate reported
- Published on Hugging Face Hub with dataset card
Requirement 2: Fine-Tuned Model
Modules: 12, 13, 14, 15
Fine-tune or adapt a language model using at least one technique from the course: full fine-tuning, LoRA/QLoRA, prompt tuning, or prefix tuning. Report training hyperparameters, loss curves, and evaluation metrics on the held-out test set. Compare the fine-tuned model against the base model on at least 3 evaluation dimensions. Publish the model (or adapter weights) to Hugging Face Hub with a model card.
- At least one adaptation technique applied
- Training loss curves and hyperparameter documentation
- Comparison against base model on 3+ evaluation dimensions
- Published on Hugging Face Hub with model card
Requirement 3: RAG System
Modules: 18, 19, 20
Implement a retrieval-augmented generation pipeline that includes document ingestion, chunking, embedding, vector storage, retrieval with at least one reranking step, and answer generation with inline citations. The system must handle at least 100 documents and demonstrate that retrieval improves answer quality compared to the model alone.
- Document ingestion and chunking pipeline
- Vector store with at least 100 documents indexed
- Retrieval with reranking (cross-encoder or similar)
- Citation generation linking claims to source documents
- Ablation: model-only vs. RAG quality comparison
Requirement 4: Agent with Tools
Modules: 21, 22
Build an agent that can use at least 3 external tools (for example: search, calculator, database query, API call, code execution). The agent must demonstrate multi-step reasoning where the output of one tool informs the next action. Include error handling for tool failures and a maximum iteration limit to prevent infinite loops.
- At least 3 distinct tools integrated
- Multi-step reasoning with tool chaining
- Error handling and graceful degradation
- Maximum iteration safety limit
- Logging of agent trajectory (thought, action, observation)
Requirement 5: Deep Research Capability
Modules: 20, 21, 22
Implement a deep research or multi-hop retrieval feature where the system can answer complex questions that require synthesizing information from multiple sources. This could be a research agent that plans queries, retrieves from multiple knowledge bases, cross-references findings, and produces a structured synthesis with citations.
- Multi-source information synthesis
- Query planning and decomposition
- Cross-referencing across retrieved documents
- Structured output with citations from multiple sources
Requirement 6: Production Deployment
Modules: 26, 27
Deploy the system as a running service (cloud VM, container, serverless, or equivalent). The deployment must include a health check endpoint, structured logging, and the ability to handle at least 5 concurrent requests. Provide deployment instructions that allow a reviewer to reproduce the deployment.
- Running service accessible via API or web interface
- Health check endpoint
- Structured logging (JSON format recommended)
- Support for concurrent requests
- Reproducible deployment instructions (Docker preferred)
Requirement 7: Security and Safety
Module: 26
Implement at least 3 safety mechanisms: input validation (prompt injection defense), output filtering (toxicity, PII redaction), and rate limiting. Document the threat model for your specific use case and explain which risks are mitigated and which are accepted with justification.
- Input validation with prompt injection defense
- Output filtering (toxicity detection, PII redaction)
- Rate limiting
- Documented threat model
Requirement 8: Evaluation Suite
Module: 25
Design and run a comprehensive evaluation suite that covers at least 4 of the following: automated metrics (BLEU, ROUGE, BERTScore), LLM-as-Judge evaluation, human evaluation (at least 50 examples), RAG-specific metrics (context relevance, faithfulness, answer relevance), agent trajectory evaluation, latency and throughput benchmarks, and adversarial testing (red teaming).
- At least 4 evaluation methods from the list above
- Statistical reporting (confidence intervals or significance tests)
- Comparison across at least 2 system configurations
- Results presented in tables and visualizations
Requirement 9: Hybrid Architecture
Modules: 09, 10, 14, 19
The system must use at least 2 different models or model configurations in a meaningful way. Examples: a small model for classification/routing plus a large model for generation; a fine-tuned model for domain tasks plus a general model for open-ended questions; an embedding model plus a generation model with a reranker in between.
- At least 2 distinct models used for different purposes
- Clear justification for why each model is used
- Cost and latency comparison across the hybrid configuration
Requirement 10: ROI Analysis
Module: 27
Provide a business case for the system including cost breakdown (compute, API, development, maintenance), estimated value (labor savings, quality improvement, revenue impact), ROI calculation, and payback period. Use realistic assumptions and document all sources for cost and value estimates.
- Itemized cost breakdown (development, infrastructure, API, maintenance)
- Value estimation with documented assumptions
- 12-month ROI calculation
- Payback period analysis
Requirement 11: Risk and Governance
Module: 26
Document the risks specific to your system (hallucination, bias, privacy, security, operational) and the governance framework you would recommend for production operation. Include an incident response plan for the most likely failure modes and a model update policy.
- Risk register with likelihood and impact ratings
- Governance framework recommendation
- Incident response plan for top 3 failure modes
- Model update and version management policy
Not every requirement needs to be implemented to the same depth. Choose 3 to 4 requirements that align most closely with your use case and implement them deeply. The remaining requirements should be addressed at least at a demonstrable level. The technical report should explain your prioritization decisions and why certain components received more investment than others.
C.2: Deliverables
Deliverable 1: GitHub Repository
The repository is the primary artifact of the capstone. It must contain all source code, configuration files, and documentation needed to understand, reproduce, and evaluate the system.
| Component | Required Contents |
|---|---|
| README.md | Project overview, architecture diagram, setup instructions, usage examples |
| src/ | Clean, documented source code with type hints and docstrings |
| tests/ | Unit tests and integration tests with at least 60% code coverage |
| evaluation/ | Evaluation scripts, metrics collection, and result analysis notebooks |
| deployment/ | Dockerfile, docker-compose.yml, or equivalent deployment configs |
| docs/ | Architecture decision records, API documentation |
| data/ | Sample data (not full dataset; link to HF Hub for full version) |
| configs/ | Model configs, prompt templates, environment variables template |
Deliverable 2: Hugging Face Hub Artifacts
Publish two artifacts to Hugging Face Hub: the fine-tuned model (or adapter weights) and the curated/synthetic dataset. Both must include complete cards following Hugging Face conventions.
- Model card: model description, intended use, training procedure, evaluation results, limitations, bias analysis, compute resources used
- Dataset card: dataset description, creation process, column descriptions, data splits, intended use, known limitations, licensing information
Publishing to Hugging Face Hub is not just a deliverable checkbox. It demonstrates that your model and dataset are reproducible, documented, and usable by others. Incomplete model cards or datasets without proper documentation will receive reduced scores even if the underlying artifacts are technically sound.
Deliverable 3: Technical Report
The technical report is an 8 to 12 page document (excluding appendices) that covers the full lifecycle of the project. It is structured as follows:
- Introduction (1 page): Problem statement, use case description, target users, and success criteria
- Related Work (0.5 page): Brief survey of similar systems or approaches
- Architecture (2 pages): System design, component interactions, technology choices with justification
- Data and Model (1.5 pages): Dataset creation, model selection, training procedure, adaptation technique
- Evaluation (2 pages): Methodology, results tables, visualizations, statistical analysis
- Deployment and Operations (1 page): Infrastructure, monitoring, security measures
- Business Case (1 page): ROI analysis, cost breakdown, risk governance
- Limitations and Future Work (1 page): Honest assessment of weaknesses, failure modes, and planned improvements
The Limitations section is one of the most important parts of the report. Projects that claim everything works perfectly are less credible than those that honestly identify what does not work and explain why. Evaluators specifically look for depth of understanding in the limitations analysis. Describe at least 3 concrete failure modes you discovered during development and how you would address them with more time.
Deliverable 4: Interpretability Analysis
Provide an interpretability analysis of your model's behavior. This can take one or more of the following forms:
- Attention analysis: Visualize attention patterns for representative inputs and explain what the model focuses on
- Token attribution: Use integrated gradients, SHAP, or similar methods to identify which input tokens most influence the output
- Probing experiments: Test whether intermediate representations encode specific features (entity types, sentiment, factual knowledge)
- Behavioral testing: Systematically test model behavior across controlled input variations (CheckList-style testing)
- Error analysis: Categorize failure modes and identify patterns in inputs that cause errors
Deliverable 5: Live Demo
Provide a live, accessible demo of the system OR a recorded screencast (5 to 10 minutes) walking through the key features. The demo must show:
- A simple query that demonstrates basic functionality
- A complex, multi-step query that exercises the agent and RAG components
- A failure case showing how the system handles errors gracefully
- The monitoring or observability dashboard (if applicable)
Deliverable 6: Presentation
Prepare and deliver a 15-minute presentation (12 minutes of content plus 3 minutes for questions). The presentation should be accessible to a mixed audience of technical engineers and business stakeholders.
| Section | Time | Content |
|---|---|---|
| Motivation | 2 min | Problem, use case, why it matters |
| Architecture | 3 min | System design, key technology choices |
| Demo | 3 min | Live or recorded walkthrough |
| Evaluation | 2 min | Key results, comparison metrics |
| Lessons Learned | 2 min | What worked, what failed, what you would change |
| Q&A | 3 min | Questions from evaluators |
Scoring Rubric
| Category | Weight | What Evaluators Look For |
|---|---|---|
| System Integration | 25% | All components work together; clean interfaces; no manual glue |
| Evaluation Rigor | 20% | Multiple evaluation methods; statistical analysis; honest reporting |
| Technical Depth | 20% | At least 3 requirements implemented deeply; strong code quality |
| Production Readiness | 15% | Deployment, monitoring, safety, error handling |
| Documentation | 10% | Technical report quality; model/dataset cards; code documentation |
| Presentation | 10% | Clear communication; appropriate level for mixed audience |
System Integration carries the highest weight (25%) because it is the hardest skill to develop and the most valuable in practice. An excellent RAG system that cannot connect to the agent, or a well-trained model that has no evaluation suite, demonstrates component-level skill but not the systems thinking that production work requires.
Example Project Ideas
The following are starting points for inspiration. Strong projects adapt these ideas to a specific domain or add unique twists that demonstrate personal technical depth.
| Project | Domain | Key Technical Challenges |
|---|---|---|
| Legal Document Analyst | Legal Tech | Long-context RAG, citation accuracy, hallucination prevention |
| Code Review Assistant | Developer Tools | Multi-file context, code-specific evaluation, tool use (linters, tests) |
| Medical Triage Chatbot | Healthcare | Safety-critical outputs, evidence-based citations, regulatory compliance |
| Financial Research Agent | Finance | Multi-source research, numerical reasoning, real-time data tools |
| Educational Tutor | EdTech | Adaptive difficulty, misconception detection, Socratic dialogue |
| Customer Support System | SaaS | Ticket routing, knowledge base RAG, escalation logic, CSAT measurement |
Choose a project where you have genuine interest or domain expertise. The capstone requires sustained effort over 4 to 6 weeks, and projects driven by curiosity consistently produce better results than projects chosen purely for technical impressiveness. Your domain knowledge will also make the evaluation more meaningful because you can assess whether the system's outputs are actually correct and useful.