Benchmark suite for evaluating how AI coding agents leverage external context tools on software engineering tasks across the SDLC. Developed as the reproducibility artifact for the paper "CodeContextBench: A Systematic Evaluation Framework for Assessing the Impact of Enhanced Code Intelligence on AI Coding Agent Performance."
This repository contains benchmark task definitions, evaluation configs, and a metrics extraction pipeline. Tasks are executed via the Harbor runner with the Claude Code agent harness.
- Researchers evaluating coding agents on realistic software engineering tasks
- Practitioners comparing baseline vs MCP-enabled agent configurations
- Contributors authoring new benchmark tasks or extending evaluation tooling
You can inspect task definitions, run validation and analysis scripts, and use the metrics/report pipeline on existing Harbor run outputs.
git clone https://github.com/sourcegraph/CodeContextBench.git
cd CodeContextBench
# Fast repo sanity check (docs/config refs)
python3 scripts/repo_health.py --quick
# Explore task-based docs navigation
sed -n '1,120p' docs/START_HERE_BY_TASK.md
# Inspect available benchmark suites
ls benchmarksRunning benchmark tasks requires:
- Harbor installed and configured
- Docker
- Valid agent/runtime credentials used by your Harbor setup
- A Max subscription (for the default harness path documented in this repo)
Recommended pre-run checks:
python3 scripts/check_infra.py
python3 scripts/validate_tasks_preflight.py --allThen start with a dry run:
bash configs/run_selected_tasks.sh --dry-rundocs/START_HERE_BY_TASK.mdfor task-oriented navigationdocs/reference/CONFIGS.mdfor the 2-config evaluation matrixdocs/EVALUATION_PIPELINE.mdfor scoring and reporting outputsdocs/REPO_HEALTH.mdfor the pre-push health gate
Eight suites organized by software development lifecycle phase:
| Suite | SDLC Phase | Tasks | Description |
|---|---|---|---|
ccb_understand |
Requirements & Discovery | 20 | Codebase comprehension, onboarding, Q&A, knowledge recovery |
ccb_design |
Architecture & Design | 20 | Architecture analysis, dependency graphs, change impact |
ccb_fix |
Bug Repair | 25 | Diagnosing and fixing real bugs across production codebases |
ccb_build |
Feature & Refactoring | 25 | New features, refactoring, dependency management |
ccb_test |
Testing & QA | 20 | Code review, performance testing, code search validation, test generation |
ccb_document |
Documentation | 20 | API references, architecture docs, migration guides, runbooks |
ccb_secure |
Security & Compliance | 20 | CVE analysis, reachability, governance, access control |
ccb_debug |
Debugging & Investigation | 20 | Root cause tracing, fault localization, provenance |
| Total | 170 |
Eleven additional suites measure cross-repo discovery, symbol resolution, dependency tracing, and deep-search-driven investigation in polyrepo environments.
| Suite | Category | Tasks | Description |
|---|---|---|---|
ccb_mcp_crossrepo_tracing |
A: Dependency Tracing | 9 | Cross-repo dependency chains, blast radius, symbol resolution |
ccb_mcp_security |
B: Vulnerability Remediation | 10 | CVE mapping, missing auth middleware across repos |
ccb_mcp_migration |
C: Framework Migration | 7 | API migrations, breaking changes across repos |
ccb_mcp_incident |
D: Incident Debugging | 11 | Error-to-code-path tracing across microservices |
ccb_mcp_onboarding |
E: Onboarding & Comprehension | 11 | API consumption mapping, end-to-end flow, architecture maps |
ccb_mcp_compliance |
F: Compliance | 7 | Standards adherence, audit, and provenance workflows |
ccb_mcp_crossorg |
G: Cross-Org Discovery | 5 | Interface implementations and authoritative repo identification across orgs |
ccb_mcp_domain |
H: Domain Lineage | 10 | Config propagation, architecture patterns, domain analysis |
ccb_mcp_org |
I: Organizational Context | 5 | Agentic discovery, org-wide coding correctness |
ccb_mcp_platform |
J: Platform Knowledge | 5 | Service template discovery and tribal knowledge |
ccb_mcp_crossrepo |
Legacy | 1 | Cross-repo discovery (compatibility) |
| Total | 81 |
Combined catalog total: 251 tasks (170 SDLC + 81 MCP-unique). Of these, 212 are fully paired (baseline + MCP results) in official runs; the remaining 39 MCP-unique tasks have MCP results but are missing baselines.
Both baseline and MCP-Full agents have access to all repos in each task's fixture. The only difference is the method: baseline reads code locally, MCP-Full uses Sourcegraph MCP tools (local code is truncated). This ensures we measure whether MCP tools help agents work better — not whether MCP can access repos the baseline can't.
See docs/MCP_UNIQUE_TASKS.md for the full task system, authoring guide, and oracle evaluation framework. See docs/MCP_UNIQUE_CALIBRATION.md for oracle coverage analysis.
All benchmarks are evaluated across two paper-level configurations (Baseline vs MCP-Full). The concrete run config names differ by task type:
- SDLC suites (
ccb_build,ccb_fix, etc.):baseline-local-direct+mcp-remote-direct - MCP-unique suites (
ccb_mcp_*):baseline-local-artifact+mcp-remote-artifact
Legacy run directory names (baseline, sourcegraph_full, artifact_full) may still appear in historical outputs and are handled by analysis scripts.
At the paper level, the distinction is still:
| Paper Config Name | Internal MCP mode | MCP Tools Available |
|---|---|---|
| Baseline | none |
None (agent uses only built-in tools) |
| MCP-Full | sourcegraph_full / artifact_full (task-dependent) |
All 13 Sourcegraph MCP tools including sg_deepsearch, sg_deepsearch_read |
See docs/reference/CONFIGS.md for the canonical configuration matrix and tool-by-tool breakdown. (docs/CONFIGS.md is a compatibility stub.)
benchmarks/ # Task definitions organized by SDLC phase + MCP-unique
ccb_build/ # Feature & Refactoring (25 tasks)
ccb_debug/ # Debugging & Investigation (20 tasks)
ccb_design/ # Architecture & Design (20 tasks)
ccb_document/ # Documentation (20 tasks)
ccb_fix/ # Bug Repair (25 tasks)
ccb_secure/ # Security & Compliance (20 tasks)
ccb_test/ # Testing & QA (20 tasks)
ccb_understand/ # Requirements & Discovery (20 tasks)
ccb_mcp_compliance/ # MCP-unique: compliance & audit (7 tasks)
ccb_mcp_crossorg/ # MCP-unique: cross-org discovery (5 tasks)
ccb_mcp_crossrepo/ # MCP-unique: legacy cross-repo (1 task)
ccb_mcp_crossrepo_tracing/ # MCP-unique: dependency tracing (9 tasks)
ccb_mcp_domain/ # MCP-unique: domain lineage (10 tasks)
ccb_mcp_incident/ # MCP-unique: incident debugging (11 tasks)
ccb_mcp_migration/ # MCP-unique: framework migration (7 tasks)
ccb_mcp_onboarding/ # MCP-unique: onboarding (11 tasks)
ccb_mcp_org/ # MCP-unique: org context (5 tasks)
ccb_mcp_platform/ # MCP-unique: platform knowledge (5 tasks)
ccb_mcp_security/ # MCP-unique: vulnerability remediation (10 tasks)
configs/ # Run configs and task selection
_common.sh # Shared infra: token refresh, parallel execution, multi-account
sdlc_suite_2config.sh # Generic SDLC runner (used by phase wrappers below)
build_2config.sh # Phase wrapper: Build (25 tasks)
debug_2config.sh # Phase wrapper: Debug (20 tasks)
design_2config.sh # Phase wrapper: Design (20 tasks)
document_2config.sh # Phase wrapper: Document (20 tasks)
fix_2config.sh # Phase wrapper: Fix (25 tasks)
secure_2config.sh # Phase wrapper: Secure (20 tasks)
test_2config.sh # Phase wrapper: Test (20 tasks)
run_selected_tasks.sh # Unified runner for all tasks
validate_one_per_benchmark.sh # Pre-flight smoke (1 task per suite)
selected_benchmark_tasks.json # Canonical SDLC task selection with metadata
selected_mcp_unique_tasks.json # MCP-unique task selection with metadata
use_case_registry.json # 100 GTM use cases (MCP-unique task source)
archive/ # Pre-SDLC migration scripts (preserved for history)
scripts/ # Metrics extraction, evaluation, and operational tooling
ccb_metrics/ # Python package: models, extractors, discovery, judge context
generate_eval_report.py # CLI: deterministic evaluation report generator
aggregate_status.py # Core run scanner (status, errors, watch mode)
status_fingerprints.py # Error classification (12 regex patterns)
validate_tasks_preflight.py # Pre-flight task validation
validate_task_run.py # Post-run validation
check_infra.py # Infrastructure readiness checker
compare_configs.py # Cross-config divergence analysis
cost_report.py # Token/cost aggregation
sync_task_metadata.py # task.toml vs selection registry reconciliation
generate_manifest.py # Rebuild MANIFEST from on-disk results
archive_run.py # Archive old runs to save disk
rerun_failed.py # Generate rerun commands for failed tasks
abc_audit.py # ABC benchmark quality audit framework
abc_score_task.py # Per-task quality scoring
abc_criteria.py # ABC criteria data model (32 criteria)
docs_consistency_check.py # Documentation drift guard
tests/ # Unit tests for scripts/
test_abc_audit.py # Tests for ABC audit framework
test_abc_criteria.py # Tests for ABC criteria data model
test_abc_score_task.py # Tests for task quality scorer
test_extract_task_metrics.py # Tests for metrics extraction
docs/ # Operational documentation
CONFIGS.md # 2-config tool breakdown
ERROR_CATALOG.md # Known error fingerprints, causes, fixes
QA_PROCESS.md # Quality assurance and validation pipeline
EVALUATION_PIPELINE.md # Unified eval: verifier → judge → statistics → report
TASK_CATALOG.md # Detailed per-task reference
TASK_SELECTION.md # Selection criteria, difficulty calibration, MCP scoring
SCORING_SEMANTICS.md # Reward and pass interpretation per benchmark
MCP_UNIQUE_TASKS.md # MCP-unique task system, authoring, oracle evaluation
MCP_UNIQUE_CALIBRATION.md # Oracle coverage analysis and threshold calibration
WORKFLOW_METRICS.md # Timing/cost metric definitions
AGENT_INTERFACE.md # Runtime I/O contract for agents
EXTENSIBILITY.md # Safe suite/task/config extension guide
LEADERBOARD.md # Ranking policy
SUBMISSION.md # Submission format specification
skills/ # AI agent skill definitions (operational runbooks)
ccb/ # CCB-specific: pre-run, monitoring, triage, analysis, maintenance
general/ # Reusable: workflow tools, agent delegation, dev practices
schemas/ # JSON schemas for MANIFEST.json, task.toml, etc.
Each suite directory contains per-task subdirectories with instruction.md, task.toml, tests/, and ground truth (or solution/). MCP-unique tasks additionally include task_spec.json, oracle_answer.json, and Dockerfile variants for baseline/MCP-only execution.
The scripts/ directory contains a stdlib-only Python 3.10+ pipeline for extracting deterministic metrics from Harbor run output:
# Generate evaluation report from Harbor runs
python3 scripts/generate_eval_report.py \
--runs-dir /path/to/runs/official/ \
--output-dir ./eval_reports/
# Generate LLM judge context files
python3 -m scripts.ccb_metrics.judge_context \
--runs-dir /path/to/runs/official/ \
--benchmarks-dir ./benchmarks/ \
--output-dir ./judge_contexts/The report generator produces:
eval_report.json-- full structured reportREPORT.md-- markdown tables (performance, efficiency, tool utilization)harness_configs.json-- exact harness configuration per run- CSV files per table for downstream analysis
See python3 scripts/generate_eval_report.py --help for all options.
To export GitHub-friendly official results (valid scored tasks only) with parsed trace summaries and local browsing UI:
python3 scripts/export_official_results.py \
--runs-dir ./runs/official/ \
--output-dir ./docs/official_results/This writes:
docs/official_results/README.md-- run/config score summarydocs/official_results/runs/*.md-- per-run task tablesdocs/official_results/tasks/*.md-- per-task metrics + parsed tool/trace viewdocs/official_results/data/official_results.json-- machine-readable datasetdocs/official_results/audits/*.json-- per-task audit artifacts (checksums + parsed trace events)docs/official_results/traces/*/trajectory.json-- bundled raw trajectory traces for GitHub auditdocs/official_results/index.html-- interactive local browser
Suite summaries are deduplicated to the latest result per
suite + config + task_name; full historical rows remain in
official_results.json under all_tasks.
For SDLC suites, export normalizes legacy config labels:
baseline -> baseline-local-direct, mcp -> mcp-remote-direct.
Serve locally:
python3 scripts/export_official_results.py --serveFor the full multi-layer evaluation pipeline (verifier, LLM judge, statistical analysis, dual-score reporting), see docs/EVALUATION_PIPELINE.md.
This section assumes Harbor is already installed and configured. If not, start with the Quickstart section above and python3 scripts/check_infra.py.
The unified runner executes all 170 SDLC tasks across the 2-config matrix:
# Run all 170 SDLC tasks across 2 configs
bash configs/run_selected_tasks.sh
# Run only the baseline config
bash configs/run_selected_tasks.sh --baseline-only
# Run a single SDLC phase
bash configs/run_selected_tasks.sh --benchmark ccb_fix
# Dry run to list tasks without executing
bash configs/run_selected_tasks.sh --dry-runPer-phase runners are also available:
bash configs/fix_2config.sh # 25 Bug Repair tasks
bash configs/build_2config.sh # 25 Feature & Refactoring tasks
bash configs/understand_2config.sh # 20 Requirements & Discovery tasks
bash configs/design_2config.sh # 20 Architecture & Design tasks
bash configs/debug_2config.sh # 20 Debugging & Investigation tasks
bash configs/secure_2config.sh # 20 Security & Compliance tasks
bash configs/test_2config.sh # 20 Testing & QA tasks
bash configs/document_2config.sh # 20 Documentation tasksMCP-unique tasks use a separate selection file:
# Run all MCP-unique tasks across 2 configs
bash configs/run_selected_tasks.sh --selection-file configs/selected_mcp_unique_tasks.json
# Filter by use-case category
bash configs/run_selected_tasks.sh --selection-file configs/selected_mcp_unique_tasks.json --benchmark ccb_mcp_security
# Dry run
bash configs/run_selected_tasks.sh --selection-file configs/selected_mcp_unique_tasks.json --dry-runAll runners support --baseline-only, --full-only, --task TASK_ID, and --parallel N flags.
CodeContextBench includes a multi-stage QA pipeline to ensure task integrity, reproducible runs, and accurate scoring.
| Phase | Script | Purpose |
|---|---|---|
| Pre-flight | scripts/validate_tasks_preflight.py |
Catches truncated instructions, template placeholders, language/difficulty mismatches, missing test.sh |
| Infra check | scripts/check_infra.py |
Verifies OAuth tokens (all accounts), Docker, disk space, Harbor CLI |
| Error fingerprinting | scripts/status_fingerprints.py |
Classifies failures with 12 regex patterns; auto-retry guidance per pattern |
| Post-run | scripts/validate_task_run.py |
Flags crashes, MCP tool usage anomalies, suspicious scoring |
| Metadata sync | scripts/sync_task_metadata.py |
Keeps task.toml in sync with selected_benchmark_tasks.json; --fix to auto-update |
| Run analysis | scripts/aggregate_status.py |
Scans run dirs, classifies per-task status, writes status.json, supports --watch mode |
The QA methodology uses a 6-dimension audit framework: instruction contamination, reproducibility, verifier correctness, ghost/false-positive detection, error misclassification, and tool effectiveness analysis.
See docs/QA_PROCESS.md for the full pipeline documentation and docs/ERROR_CATALOG.md for the known error catalog.
Key scripts organized by workflow phase:
| Phase | Script | Usage |
|---|---|---|
| Pre-run | validate_tasks_preflight.py |
python3 scripts/validate_tasks_preflight.py [--suite ccb_pytorch] [--task sgt-001] |
| Pre-run | check_infra.py |
python3 scripts/check_infra.py |
| During run | aggregate_status.py --since 2h |
python3 scripts/aggregate_status.py --since 2h |
| Post-run | aggregate_status.py |
python3 scripts/aggregate_status.py [--watch] |
| Post-run | validate_task_run.py |
python3 scripts/validate_task_run.py <run_dir> |
| Analysis | compare_configs.py |
python3 scripts/compare_configs.py |
| Analysis | cost_report.py |
python3 scripts/cost_report.py |
| Analysis | generate_manifest.py |
python3 scripts/generate_manifest.py |
| Maintenance | sync_task_metadata.py |
python3 scripts/sync_task_metadata.py [--fix] |
| Maintenance | archive_run.py |
python3 scripts/archive_run.py <run_dir> [--compress] |
| Maintenance | rerun_failed.py |
python3 scripts/rerun_failed.py [--fingerprint timeout] [--suite ccb_pytorch] |
The skills/ directory contains structured runbooks for AI coding agents operating on this repository. These encode operational workflows — infrastructure checks, task validation, failure triage, report generation — so any agent (Claude Code, Cursor, Copilot, etc.) can follow them autonomously.
| Category | Skills | Description |
|---|---|---|
| CCB Operations | 20 skills in 6 files | Pre-run checks, monitoring, triage, analysis, maintenance, task authoring |
| General Purpose | 11 skills in 4 files | Session management, agent delegation, search patterns, dev practices |
Skills are plain markdown and tool-agnostic. See skills/README.md for the full index and integration guides for Cursor, Claude Code, and other agents. See docs/SKILLS.md for background on the skills system.
See LICENSE.