CodeContextBench

Benchmark suite for evaluating how AI coding agents leverage external context tools on software engineering tasks across the SDLC. Developed as the reproducibility artifact for the paper "CodeContextBench: A Systematic Evaluation Framework for Assessing the Impact of Enhanced Code Intelligence on AI Coding Agent Performance."

This repository contains benchmark task definitions, evaluation configs, and a metrics extraction pipeline. Tasks are executed via the Harbor runner with the Claude Code agent harness.

Quickstart (Public / First-Time Users)

Who this repo is for

Researchers evaluating coding agents on realistic software engineering tasks
Practitioners comparing baseline vs MCP-enabled agent configurations
Contributors authoring new benchmark tasks or extending evaluation tooling

What you can do without Harbor

You can inspect task definitions, run validation and analysis scripts, and use the metrics/report pipeline on existing Harbor run outputs.

git clone https://github.com/sourcegraph/CodeContextBench.git
cd CodeContextBench

# Fast repo sanity check (docs/config refs)
python3 scripts/repo_health.py --quick

# Explore task-based docs navigation
sed -n '1,120p' docs/START_HERE_BY_TASK.md

# Inspect available benchmark suites
ls benchmarks

What requires Harbor (benchmark execution)

Running benchmark tasks requires:

Harbor installed and configured
Docker
Valid agent/runtime credentials used by your Harbor setup
A Max subscription (for the default harness path documented in this repo)

Recommended pre-run checks:

python3 scripts/check_infra.py
python3 scripts/validate_tasks_preflight.py --all

Then start with a dry run:

bash configs/run_selected_tasks.sh --dry-run

First places to read

docs/START_HERE_BY_TASK.md for task-oriented navigation
docs/reference/CONFIGS.md for the 2-config evaluation matrix
docs/EVALUATION_PIPELINE.md for scoring and reporting outputs
docs/REPO_HEALTH.md for the pre-push health gate

Benchmark Suites (SDLC-Aligned)

Eight suites organized by software development lifecycle phase:

Suite	SDLC Phase	Tasks	Description
`ccb_understand`	Requirements & Discovery	20	Codebase comprehension, onboarding, Q&A, knowledge recovery
`ccb_design`	Architecture & Design	20	Architecture analysis, dependency graphs, change impact
`ccb_fix`	Bug Repair	25	Diagnosing and fixing real bugs across production codebases
`ccb_build`	Feature & Refactoring	25	New features, refactoring, dependency management
`ccb_test`	Testing & QA	20	Code review, performance testing, code search validation, test generation
`ccb_document`	Documentation	20	API references, architecture docs, migration guides, runbooks
`ccb_secure`	Security & Compliance	20	CVE analysis, reachability, governance, access control
`ccb_debug`	Debugging & Investigation	20	Root cause tracing, fault localization, provenance
Total		170

MCP-Unique Suites (Org-Scale Context Retrieval)

Eleven additional suites measure cross-repo discovery, symbol resolution, dependency tracing, and deep-search-driven investigation in polyrepo environments.

Suite	Category	Tasks	Description
`ccb_mcp_crossrepo_tracing`	A: Dependency Tracing	9	Cross-repo dependency chains, blast radius, symbol resolution
`ccb_mcp_security`	B: Vulnerability Remediation	10	CVE mapping, missing auth middleware across repos
`ccb_mcp_migration`	C: Framework Migration	7	API migrations, breaking changes across repos
`ccb_mcp_incident`	D: Incident Debugging	11	Error-to-code-path tracing across microservices
`ccb_mcp_onboarding`	E: Onboarding & Comprehension	11	API consumption mapping, end-to-end flow, architecture maps
`ccb_mcp_compliance`	F: Compliance	7	Standards adherence, audit, and provenance workflows
`ccb_mcp_crossorg`	G: Cross-Org Discovery	5	Interface implementations and authoritative repo identification across orgs
`ccb_mcp_domain`	H: Domain Lineage	10	Config propagation, architecture patterns, domain analysis
`ccb_mcp_org`	I: Organizational Context	5	Agentic discovery, org-wide coding correctness
`ccb_mcp_platform`	J: Platform Knowledge	5	Service template discovery and tribal knowledge
`ccb_mcp_crossrepo`	Legacy	1	Cross-repo discovery (compatibility)
Total		81

Combined catalog total: 251 tasks (170 SDLC + 81 MCP-unique). Of these, 212 are fully paired (baseline + MCP results) in official runs; the remaining 39 MCP-unique tasks have MCP results but are missing baselines.

Both baseline and MCP-Full agents have access to all repos in each task's fixture. The only difference is the method: baseline reads code locally, MCP-Full uses Sourcegraph MCP tools (local code is truncated). This ensures we measure whether MCP tools help agents work better — not whether MCP can access repos the baseline can't.

See docs/MCP_UNIQUE_TASKS.md for the full task system, authoring guide, and oracle evaluation framework. See docs/MCP_UNIQUE_CALIBRATION.md for oracle coverage analysis.

2-Config Evaluation Matrix

All benchmarks are evaluated across two paper-level configurations (Baseline vs MCP-Full). The concrete run config names differ by task type:

SDLC suites (ccb_build, ccb_fix, etc.): baseline-local-direct + mcp-remote-direct
MCP-unique suites (ccb_mcp_*): baseline-local-artifact + mcp-remote-artifact

Legacy run directory names (baseline, sourcegraph_full, artifact_full) may still appear in historical outputs and are handled by analysis scripts.

At the paper level, the distinction is still:

Paper Config Name	Internal MCP mode	MCP Tools Available
Baseline	`none`	None (agent uses only built-in tools)
MCP-Full	`sourcegraph_full` / `artifact_full` (task-dependent)	All 13 Sourcegraph MCP tools including `sg_deepsearch`, `sg_deepsearch_read`

See docs/reference/CONFIGS.md for the canonical configuration matrix and tool-by-tool breakdown. (docs/CONFIGS.md is a compatibility stub.)

Repository Structure

benchmarks/              # Task definitions organized by SDLC phase + MCP-unique
  ccb_build/             #   Feature & Refactoring (25 tasks)
  ccb_debug/             #   Debugging & Investigation (20 tasks)
  ccb_design/            #   Architecture & Design (20 tasks)
  ccb_document/          #   Documentation (20 tasks)
  ccb_fix/               #   Bug Repair (25 tasks)
  ccb_secure/            #   Security & Compliance (20 tasks)
  ccb_test/              #   Testing & QA (20 tasks)
  ccb_understand/        #   Requirements & Discovery (20 tasks)
  ccb_mcp_compliance/    #   MCP-unique: compliance & audit (7 tasks)
  ccb_mcp_crossorg/      #   MCP-unique: cross-org discovery (5 tasks)
  ccb_mcp_crossrepo/     #   MCP-unique: legacy cross-repo (1 task)
  ccb_mcp_crossrepo_tracing/  #   MCP-unique: dependency tracing (9 tasks)
  ccb_mcp_domain/        #   MCP-unique: domain lineage (10 tasks)
  ccb_mcp_incident/      #   MCP-unique: incident debugging (11 tasks)
  ccb_mcp_migration/     #   MCP-unique: framework migration (7 tasks)
  ccb_mcp_onboarding/    #   MCP-unique: onboarding (11 tasks)
  ccb_mcp_org/           #   MCP-unique: org context (5 tasks)
  ccb_mcp_platform/      #   MCP-unique: platform knowledge (5 tasks)
  ccb_mcp_security/      #   MCP-unique: vulnerability remediation (10 tasks)
configs/                 # Run configs and task selection
  _common.sh             #   Shared infra: token refresh, parallel execution, multi-account
  sdlc_suite_2config.sh  #   Generic SDLC runner (used by phase wrappers below)
  build_2config.sh       #   Phase wrapper: Build (25 tasks)
  debug_2config.sh       #   Phase wrapper: Debug (20 tasks)
  design_2config.sh      #   Phase wrapper: Design (20 tasks)
  document_2config.sh    #   Phase wrapper: Document (20 tasks)
  fix_2config.sh         #   Phase wrapper: Fix (25 tasks)
  secure_2config.sh      #   Phase wrapper: Secure (20 tasks)
  test_2config.sh        #   Phase wrapper: Test (20 tasks)
  run_selected_tasks.sh  #   Unified runner for all tasks
  validate_one_per_benchmark.sh  # Pre-flight smoke (1 task per suite)
  selected_benchmark_tasks.json  # Canonical SDLC task selection with metadata
  selected_mcp_unique_tasks.json # MCP-unique task selection with metadata
  use_case_registry.json #   100 GTM use cases (MCP-unique task source)
  archive/               #   Pre-SDLC migration scripts (preserved for history)
scripts/                 # Metrics extraction, evaluation, and operational tooling
  ccb_metrics/           #   Python package: models, extractors, discovery, judge context
  generate_eval_report.py  # CLI: deterministic evaluation report generator
  aggregate_status.py    #   Core run scanner (status, errors, watch mode)
  status_fingerprints.py #   Error classification (12 regex patterns)
  validate_tasks_preflight.py # Pre-flight task validation
  validate_task_run.py   #   Post-run validation
  check_infra.py         #   Infrastructure readiness checker
  compare_configs.py     #   Cross-config divergence analysis
  cost_report.py         #   Token/cost aggregation
  sync_task_metadata.py  #   task.toml vs selection registry reconciliation
  generate_manifest.py   #   Rebuild MANIFEST from on-disk results
  archive_run.py         #   Archive old runs to save disk
  rerun_failed.py        #   Generate rerun commands for failed tasks
  abc_audit.py           #   ABC benchmark quality audit framework
  abc_score_task.py      #   Per-task quality scoring
  abc_criteria.py        #   ABC criteria data model (32 criteria)
  docs_consistency_check.py # Documentation drift guard
tests/                   # Unit tests for scripts/
  test_abc_audit.py      #   Tests for ABC audit framework
  test_abc_criteria.py   #   Tests for ABC criteria data model
  test_abc_score_task.py #   Tests for task quality scorer
  test_extract_task_metrics.py # Tests for metrics extraction
docs/                    # Operational documentation
  CONFIGS.md             #   2-config tool breakdown
  ERROR_CATALOG.md       #   Known error fingerprints, causes, fixes
  QA_PROCESS.md          #   Quality assurance and validation pipeline
  EVALUATION_PIPELINE.md #   Unified eval: verifier → judge → statistics → report
  TASK_CATALOG.md        #   Detailed per-task reference
  TASK_SELECTION.md      #   Selection criteria, difficulty calibration, MCP scoring
  SCORING_SEMANTICS.md   #   Reward and pass interpretation per benchmark
  MCP_UNIQUE_TASKS.md    #   MCP-unique task system, authoring, oracle evaluation
  MCP_UNIQUE_CALIBRATION.md # Oracle coverage analysis and threshold calibration
  WORKFLOW_METRICS.md    #   Timing/cost metric definitions
  AGENT_INTERFACE.md     #   Runtime I/O contract for agents
  EXTENSIBILITY.md       #   Safe suite/task/config extension guide
  LEADERBOARD.md         #   Ranking policy
  SUBMISSION.md          #   Submission format specification
skills/                  # AI agent skill definitions (operational runbooks)
  ccb/                   #   CCB-specific: pre-run, monitoring, triage, analysis, maintenance
  general/               #   Reusable: workflow tools, agent delegation, dev practices
schemas/                 # JSON schemas for MANIFEST.json, task.toml, etc.

Each suite directory contains per-task subdirectories with instruction.md, task.toml, tests/, and ground truth (or solution/). MCP-unique tasks additionally include task_spec.json, oracle_answer.json, and Dockerfile variants for baseline/MCP-only execution.

Metrics Extraction Pipeline

The scripts/ directory contains a stdlib-only Python 3.10+ pipeline for extracting deterministic metrics from Harbor run output:

# Generate evaluation report from Harbor runs
python3 scripts/generate_eval_report.py \
  --runs-dir /path/to/runs/official/ \
  --output-dir ./eval_reports/

# Generate LLM judge context files
python3 -m scripts.ccb_metrics.judge_context \
  --runs-dir /path/to/runs/official/ \
  --benchmarks-dir ./benchmarks/ \
  --output-dir ./judge_contexts/

The report generator produces:

eval_report.json -- full structured report
REPORT.md -- markdown tables (performance, efficiency, tool utilization)
harness_configs.json -- exact harness configuration per run
CSV files per table for downstream analysis

See python3 scripts/generate_eval_report.py --help for all options.

Publishable Official Results + Trace Browser

To export GitHub-friendly official results (valid scored tasks only) with parsed trace summaries and local browsing UI:

python3 scripts/export_official_results.py \
  --runs-dir ./runs/official/ \
  --output-dir ./docs/official_results/

This writes:

docs/official_results/README.md -- run/config score summary
docs/official_results/runs/*.md -- per-run task tables
docs/official_results/tasks/*.md -- per-task metrics + parsed tool/trace view
docs/official_results/data/official_results.json -- machine-readable dataset
docs/official_results/audits/*.json -- per-task audit artifacts (checksums + parsed trace events)
docs/official_results/traces/*/trajectory.json -- bundled raw trajectory traces for GitHub audit
docs/official_results/index.html -- interactive local browser

Suite summaries are deduplicated to the latest result per suite + config + task_name; full historical rows remain in official_results.json under all_tasks. For SDLC suites, export normalizes legacy config labels: baseline -> baseline-local-direct, mcp -> mcp-remote-direct.

Serve locally:

python3 scripts/export_official_results.py --serve

For the full multi-layer evaluation pipeline (verifier, LLM judge, statistical analysis, dual-score reporting), see docs/EVALUATION_PIPELINE.md.

Running with Harbor

This section assumes Harbor is already installed and configured. If not, start with the Quickstart section above and python3 scripts/check_infra.py.

SDLC Tasks

The unified runner executes all 170 SDLC tasks across the 2-config matrix:

# Run all 170 SDLC tasks across 2 configs
bash configs/run_selected_tasks.sh

# Run only the baseline config
bash configs/run_selected_tasks.sh --baseline-only

# Run a single SDLC phase
bash configs/run_selected_tasks.sh --benchmark ccb_fix

# Dry run to list tasks without executing
bash configs/run_selected_tasks.sh --dry-run

Per-phase runners are also available:

bash configs/fix_2config.sh              # 25 Bug Repair tasks
bash configs/build_2config.sh            # 25 Feature & Refactoring tasks
bash configs/understand_2config.sh       # 20 Requirements & Discovery tasks
bash configs/design_2config.sh           # 20 Architecture & Design tasks
bash configs/debug_2config.sh            # 20 Debugging & Investigation tasks
bash configs/secure_2config.sh           # 20 Security & Compliance tasks
bash configs/test_2config.sh             # 20 Testing & QA tasks
bash configs/document_2config.sh         # 20 Documentation tasks

MCP-Unique Tasks

MCP-unique tasks use a separate selection file:

# Run all MCP-unique tasks across 2 configs
bash configs/run_selected_tasks.sh --selection-file configs/selected_mcp_unique_tasks.json

# Filter by use-case category
bash configs/run_selected_tasks.sh --selection-file configs/selected_mcp_unique_tasks.json --benchmark ccb_mcp_security

# Dry run
bash configs/run_selected_tasks.sh --selection-file configs/selected_mcp_unique_tasks.json --dry-run

All runners support --baseline-only, --full-only, --task TASK_ID, and --parallel N flags.

Quality Assurance & Validation

CodeContextBench includes a multi-stage QA pipeline to ensure task integrity, reproducible runs, and accurate scoring.

Phase	Script	Purpose
Pre-flight	`scripts/validate_tasks_preflight.py`	Catches truncated instructions, template placeholders, language/difficulty mismatches, missing test.sh
Infra check	`scripts/check_infra.py`	Verifies OAuth tokens (all accounts), Docker, disk space, Harbor CLI
Error fingerprinting	`scripts/status_fingerprints.py`	Classifies failures with 12 regex patterns; auto-retry guidance per pattern
Post-run	`scripts/validate_task_run.py`	Flags crashes, MCP tool usage anomalies, suspicious scoring
Metadata sync	`scripts/sync_task_metadata.py`	Keeps task.toml in sync with `selected_benchmark_tasks.json`; `--fix` to auto-update
Run analysis	`scripts/aggregate_status.py`	Scans run dirs, classifies per-task status, writes status.json, supports `--watch` mode

The QA methodology uses a 6-dimension audit framework: instruction contamination, reproducibility, verifier correctness, ghost/false-positive detection, error misclassification, and tool effectiveness analysis.

See docs/QA_PROCESS.md for the full pipeline documentation and docs/ERROR_CATALOG.md for the known error catalog.

Operational Tooling

Key scripts organized by workflow phase:

Phase	Script	Usage
Pre-run	`validate_tasks_preflight.py`	`python3 scripts/validate_tasks_preflight.py [--suite ccb_pytorch] [--task sgt-001]`
Pre-run	`check_infra.py`	`python3 scripts/check_infra.py`
During run	`aggregate_status.py --since 2h`	`python3 scripts/aggregate_status.py --since 2h`
Post-run	`aggregate_status.py`	`python3 scripts/aggregate_status.py [--watch]`
Post-run	`validate_task_run.py`	`python3 scripts/validate_task_run.py <run_dir>`
Analysis	`compare_configs.py`	`python3 scripts/compare_configs.py`
Analysis	`cost_report.py`	`python3 scripts/cost_report.py`
Analysis	`generate_manifest.py`	`python3 scripts/generate_manifest.py`
Maintenance	`sync_task_metadata.py`	`python3 scripts/sync_task_metadata.py [--fix]`
Maintenance	`archive_run.py`	`python3 scripts/archive_run.py <run_dir> [--compress]`
Maintenance	`rerun_failed.py`	`python3 scripts/rerun_failed.py [--fingerprint timeout] [--suite ccb_pytorch]`

AI Agent Skills

The skills/ directory contains structured runbooks for AI coding agents operating on this repository. These encode operational workflows — infrastructure checks, task validation, failure triage, report generation — so any agent (Claude Code, Cursor, Copilot, etc.) can follow them autonomously.

Category	Skills	Description
CCB Operations	20 skills in 6 files	Pre-run checks, monitoring, triage, analysis, maintenance, task authoring
General Purpose	11 skills in 4 files	Session management, agent delegation, search patterns, dev practices

Skills are plain markdown and tool-agnostic. See skills/README.md for the full index and integration guides for Cursor, Claude Code, and other agents. See docs/SKILLS.md for background on the skills system.

License

See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 827 Commits
.github/workflows		.github/workflows
agents		agents
base_images		base_images
benchmarks		benchmarks
configs		configs
docs		docs
fixtures/repo_sets		fixtures/repo_sets
lib		lib
runs/official		runs/official
schemas		schemas
scripts		scripts
skills		skills
templates/mcp_unique_task		templates/mcp_unique_task
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
run-eval		run-eval

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CodeContextBench

Quickstart (Public / First-Time Users)

Who this repo is for

What you can do without Harbor

What requires Harbor (benchmark execution)

First places to read

Benchmark Suites (SDLC-Aligned)

MCP-Unique Suites (Org-Scale Context Retrieval)

2-Config Evaluation Matrix

Repository Structure

Metrics Extraction Pipeline

Publishable Official Results + Trace Browser

Running with Harbor

SDLC Tasks

MCP-Unique Tasks

Quality Assurance & Validation

Operational Tooling

AI Agent Skills

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

License

sourcegraph/CodeContextBench

Folders and files

Latest commit

History

Repository files navigation

CodeContextBench

Quickstart (Public / First-Time Users)

Who this repo is for

What you can do without Harbor

What requires Harbor (benchmark execution)

First places to read

Benchmark Suites (SDLC-Aligned)

MCP-Unique Suites (Org-Scale Context Retrieval)

2-Config Evaluation Matrix

Repository Structure

Metrics Extraction Pipeline

Publishable Official Results + Trace Browser

Running with Harbor

SDLC Tasks

MCP-Unique Tasks

Quality Assurance & Validation

Operational Tooling

AI Agent Skills

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages