Skip to content

sourcegraph/CodeContextBench

CodeContextBench

Benchmark suite for evaluating how AI coding agents leverage external context tools on software engineering tasks across the SDLC. Developed as the reproducibility artifact for the paper "CodeContextBench: A Systematic Evaluation Framework for Assessing the Impact of Enhanced Code Intelligence on AI Coding Agent Performance."

This repository contains benchmark task definitions, evaluation configs, and a metrics extraction pipeline. Tasks are executed via the Harbor runner with the Claude Code agent harness.


Quickstart (Public / First-Time Users)

Who this repo is for

  • Researchers evaluating coding agents on realistic software engineering tasks
  • Practitioners comparing baseline vs MCP-enabled agent configurations
  • Contributors authoring new benchmark tasks or extending evaluation tooling

What you can do without Harbor

You can inspect task definitions, run validation and analysis scripts, and use the metrics/report pipeline on existing Harbor run outputs.

git clone https://github.com/sourcegraph/CodeContextBench.git
cd CodeContextBench

# Fast repo sanity check (docs/config refs)
python3 scripts/repo_health.py --quick

# Explore task-based docs navigation
sed -n '1,120p' docs/START_HERE_BY_TASK.md

# Inspect available benchmark suites
ls benchmarks

What requires Harbor (benchmark execution)

Running benchmark tasks requires:

  • Harbor installed and configured
  • Docker
  • Valid agent/runtime credentials used by your Harbor setup
  • A Max subscription (for the default harness path documented in this repo)

Recommended pre-run checks:

python3 scripts/check_infra.py
python3 scripts/validate_tasks_preflight.py --all

Then start with a dry run:

bash configs/run_selected_tasks.sh --dry-run

First places to read

  • docs/START_HERE_BY_TASK.md for task-oriented navigation
  • docs/reference/CONFIGS.md for the 2-config evaluation matrix
  • docs/EVALUATION_PIPELINE.md for scoring and reporting outputs
  • docs/REPO_HEALTH.md for the pre-push health gate

Benchmark Suites (SDLC-Aligned)

Eight suites organized by software development lifecycle phase:

Suite SDLC Phase Tasks Description
ccb_understand Requirements & Discovery 20 Codebase comprehension, onboarding, Q&A, knowledge recovery
ccb_design Architecture & Design 20 Architecture analysis, dependency graphs, change impact
ccb_fix Bug Repair 25 Diagnosing and fixing real bugs across production codebases
ccb_build Feature & Refactoring 25 New features, refactoring, dependency management
ccb_test Testing & QA 20 Code review, performance testing, code search validation, test generation
ccb_document Documentation 20 API references, architecture docs, migration guides, runbooks
ccb_secure Security & Compliance 20 CVE analysis, reachability, governance, access control
ccb_debug Debugging & Investigation 20 Root cause tracing, fault localization, provenance
Total 170

MCP-Unique Suites (Org-Scale Context Retrieval)

Eleven additional suites measure cross-repo discovery, symbol resolution, dependency tracing, and deep-search-driven investigation in polyrepo environments.

Suite Category Tasks Description
ccb_mcp_crossrepo_tracing A: Dependency Tracing 9 Cross-repo dependency chains, blast radius, symbol resolution
ccb_mcp_security B: Vulnerability Remediation 10 CVE mapping, missing auth middleware across repos
ccb_mcp_migration C: Framework Migration 7 API migrations, breaking changes across repos
ccb_mcp_incident D: Incident Debugging 11 Error-to-code-path tracing across microservices
ccb_mcp_onboarding E: Onboarding & Comprehension 11 API consumption mapping, end-to-end flow, architecture maps
ccb_mcp_compliance F: Compliance 7 Standards adherence, audit, and provenance workflows
ccb_mcp_crossorg G: Cross-Org Discovery 5 Interface implementations and authoritative repo identification across orgs
ccb_mcp_domain H: Domain Lineage 10 Config propagation, architecture patterns, domain analysis
ccb_mcp_org I: Organizational Context 5 Agentic discovery, org-wide coding correctness
ccb_mcp_platform J: Platform Knowledge 5 Service template discovery and tribal knowledge
ccb_mcp_crossrepo Legacy 1 Cross-repo discovery (compatibility)
Total 81

Combined catalog total: 251 tasks (170 SDLC + 81 MCP-unique). Of these, 212 are fully paired (baseline + MCP results) in official runs; the remaining 39 MCP-unique tasks have MCP results but are missing baselines.

Both baseline and MCP-Full agents have access to all repos in each task's fixture. The only difference is the method: baseline reads code locally, MCP-Full uses Sourcegraph MCP tools (local code is truncated). This ensures we measure whether MCP tools help agents work better — not whether MCP can access repos the baseline can't.

See docs/MCP_UNIQUE_TASKS.md for the full task system, authoring guide, and oracle evaluation framework. See docs/MCP_UNIQUE_CALIBRATION.md for oracle coverage analysis.


2-Config Evaluation Matrix

All benchmarks are evaluated across two paper-level configurations (Baseline vs MCP-Full). The concrete run config names differ by task type:

  • SDLC suites (ccb_build, ccb_fix, etc.): baseline-local-direct + mcp-remote-direct
  • MCP-unique suites (ccb_mcp_*): baseline-local-artifact + mcp-remote-artifact

Legacy run directory names (baseline, sourcegraph_full, artifact_full) may still appear in historical outputs and are handled by analysis scripts.

At the paper level, the distinction is still:

Paper Config Name Internal MCP mode MCP Tools Available
Baseline none None (agent uses only built-in tools)
MCP-Full sourcegraph_full / artifact_full (task-dependent) All 13 Sourcegraph MCP tools including sg_deepsearch, sg_deepsearch_read

See docs/reference/CONFIGS.md for the canonical configuration matrix and tool-by-tool breakdown. (docs/CONFIGS.md is a compatibility stub.)


Repository Structure

benchmarks/              # Task definitions organized by SDLC phase + MCP-unique
  ccb_build/             #   Feature & Refactoring (25 tasks)
  ccb_debug/             #   Debugging & Investigation (20 tasks)
  ccb_design/            #   Architecture & Design (20 tasks)
  ccb_document/          #   Documentation (20 tasks)
  ccb_fix/               #   Bug Repair (25 tasks)
  ccb_secure/            #   Security & Compliance (20 tasks)
  ccb_test/              #   Testing & QA (20 tasks)
  ccb_understand/        #   Requirements & Discovery (20 tasks)
  ccb_mcp_compliance/    #   MCP-unique: compliance & audit (7 tasks)
  ccb_mcp_crossorg/      #   MCP-unique: cross-org discovery (5 tasks)
  ccb_mcp_crossrepo/     #   MCP-unique: legacy cross-repo (1 task)
  ccb_mcp_crossrepo_tracing/  #   MCP-unique: dependency tracing (9 tasks)
  ccb_mcp_domain/        #   MCP-unique: domain lineage (10 tasks)
  ccb_mcp_incident/      #   MCP-unique: incident debugging (11 tasks)
  ccb_mcp_migration/     #   MCP-unique: framework migration (7 tasks)
  ccb_mcp_onboarding/    #   MCP-unique: onboarding (11 tasks)
  ccb_mcp_org/           #   MCP-unique: org context (5 tasks)
  ccb_mcp_platform/      #   MCP-unique: platform knowledge (5 tasks)
  ccb_mcp_security/      #   MCP-unique: vulnerability remediation (10 tasks)
configs/                 # Run configs and task selection
  _common.sh             #   Shared infra: token refresh, parallel execution, multi-account
  sdlc_suite_2config.sh  #   Generic SDLC runner (used by phase wrappers below)
  build_2config.sh       #   Phase wrapper: Build (25 tasks)
  debug_2config.sh       #   Phase wrapper: Debug (20 tasks)
  design_2config.sh      #   Phase wrapper: Design (20 tasks)
  document_2config.sh    #   Phase wrapper: Document (20 tasks)
  fix_2config.sh         #   Phase wrapper: Fix (25 tasks)
  secure_2config.sh      #   Phase wrapper: Secure (20 tasks)
  test_2config.sh        #   Phase wrapper: Test (20 tasks)
  run_selected_tasks.sh  #   Unified runner for all tasks
  validate_one_per_benchmark.sh  # Pre-flight smoke (1 task per suite)
  selected_benchmark_tasks.json  # Canonical SDLC task selection with metadata
  selected_mcp_unique_tasks.json # MCP-unique task selection with metadata
  use_case_registry.json #   100 GTM use cases (MCP-unique task source)
  archive/               #   Pre-SDLC migration scripts (preserved for history)
scripts/                 # Metrics extraction, evaluation, and operational tooling
  ccb_metrics/           #   Python package: models, extractors, discovery, judge context
  generate_eval_report.py  # CLI: deterministic evaluation report generator
  aggregate_status.py    #   Core run scanner (status, errors, watch mode)
  status_fingerprints.py #   Error classification (12 regex patterns)
  validate_tasks_preflight.py # Pre-flight task validation
  validate_task_run.py   #   Post-run validation
  check_infra.py         #   Infrastructure readiness checker
  compare_configs.py     #   Cross-config divergence analysis
  cost_report.py         #   Token/cost aggregation
  sync_task_metadata.py  #   task.toml vs selection registry reconciliation
  generate_manifest.py   #   Rebuild MANIFEST from on-disk results
  archive_run.py         #   Archive old runs to save disk
  rerun_failed.py        #   Generate rerun commands for failed tasks
  abc_audit.py           #   ABC benchmark quality audit framework
  abc_score_task.py      #   Per-task quality scoring
  abc_criteria.py        #   ABC criteria data model (32 criteria)
  docs_consistency_check.py # Documentation drift guard
tests/                   # Unit tests for scripts/
  test_abc_audit.py      #   Tests for ABC audit framework
  test_abc_criteria.py   #   Tests for ABC criteria data model
  test_abc_score_task.py #   Tests for task quality scorer
  test_extract_task_metrics.py # Tests for metrics extraction
docs/                    # Operational documentation
  CONFIGS.md             #   2-config tool breakdown
  ERROR_CATALOG.md       #   Known error fingerprints, causes, fixes
  QA_PROCESS.md          #   Quality assurance and validation pipeline
  EVALUATION_PIPELINE.md #   Unified eval: verifier → judge → statistics → report
  TASK_CATALOG.md        #   Detailed per-task reference
  TASK_SELECTION.md      #   Selection criteria, difficulty calibration, MCP scoring
  SCORING_SEMANTICS.md   #   Reward and pass interpretation per benchmark
  MCP_UNIQUE_TASKS.md    #   MCP-unique task system, authoring, oracle evaluation
  MCP_UNIQUE_CALIBRATION.md # Oracle coverage analysis and threshold calibration
  WORKFLOW_METRICS.md    #   Timing/cost metric definitions
  AGENT_INTERFACE.md     #   Runtime I/O contract for agents
  EXTENSIBILITY.md       #   Safe suite/task/config extension guide
  LEADERBOARD.md         #   Ranking policy
  SUBMISSION.md          #   Submission format specification
skills/                  # AI agent skill definitions (operational runbooks)
  ccb/                   #   CCB-specific: pre-run, monitoring, triage, analysis, maintenance
  general/               #   Reusable: workflow tools, agent delegation, dev practices
schemas/                 # JSON schemas for MANIFEST.json, task.toml, etc.

Each suite directory contains per-task subdirectories with instruction.md, task.toml, tests/, and ground truth (or solution/). MCP-unique tasks additionally include task_spec.json, oracle_answer.json, and Dockerfile variants for baseline/MCP-only execution.


Metrics Extraction Pipeline

The scripts/ directory contains a stdlib-only Python 3.10+ pipeline for extracting deterministic metrics from Harbor run output:

# Generate evaluation report from Harbor runs
python3 scripts/generate_eval_report.py \
  --runs-dir /path/to/runs/official/ \
  --output-dir ./eval_reports/

# Generate LLM judge context files
python3 -m scripts.ccb_metrics.judge_context \
  --runs-dir /path/to/runs/official/ \
  --benchmarks-dir ./benchmarks/ \
  --output-dir ./judge_contexts/

The report generator produces:

  • eval_report.json -- full structured report
  • REPORT.md -- markdown tables (performance, efficiency, tool utilization)
  • harness_configs.json -- exact harness configuration per run
  • CSV files per table for downstream analysis

See python3 scripts/generate_eval_report.py --help for all options.

Publishable Official Results + Trace Browser

To export GitHub-friendly official results (valid scored tasks only) with parsed trace summaries and local browsing UI:

python3 scripts/export_official_results.py \
  --runs-dir ./runs/official/ \
  --output-dir ./docs/official_results/

This writes:

  • docs/official_results/README.md -- run/config score summary
  • docs/official_results/runs/*.md -- per-run task tables
  • docs/official_results/tasks/*.md -- per-task metrics + parsed tool/trace view
  • docs/official_results/data/official_results.json -- machine-readable dataset
  • docs/official_results/audits/*.json -- per-task audit artifacts (checksums + parsed trace events)
  • docs/official_results/traces/*/trajectory.json -- bundled raw trajectory traces for GitHub audit
  • docs/official_results/index.html -- interactive local browser

Suite summaries are deduplicated to the latest result per suite + config + task_name; full historical rows remain in official_results.json under all_tasks. For SDLC suites, export normalizes legacy config labels: baseline -> baseline-local-direct, mcp -> mcp-remote-direct.

Serve locally:

python3 scripts/export_official_results.py --serve

For the full multi-layer evaluation pipeline (verifier, LLM judge, statistical analysis, dual-score reporting), see docs/EVALUATION_PIPELINE.md.


Running with Harbor

This section assumes Harbor is already installed and configured. If not, start with the Quickstart section above and python3 scripts/check_infra.py.

SDLC Tasks

The unified runner executes all 170 SDLC tasks across the 2-config matrix:

# Run all 170 SDLC tasks across 2 configs
bash configs/run_selected_tasks.sh

# Run only the baseline config
bash configs/run_selected_tasks.sh --baseline-only

# Run a single SDLC phase
bash configs/run_selected_tasks.sh --benchmark ccb_fix

# Dry run to list tasks without executing
bash configs/run_selected_tasks.sh --dry-run

Per-phase runners are also available:

bash configs/fix_2config.sh              # 25 Bug Repair tasks
bash configs/build_2config.sh            # 25 Feature & Refactoring tasks
bash configs/understand_2config.sh       # 20 Requirements & Discovery tasks
bash configs/design_2config.sh           # 20 Architecture & Design tasks
bash configs/debug_2config.sh            # 20 Debugging & Investigation tasks
bash configs/secure_2config.sh           # 20 Security & Compliance tasks
bash configs/test_2config.sh             # 20 Testing & QA tasks
bash configs/document_2config.sh         # 20 Documentation tasks

MCP-Unique Tasks

MCP-unique tasks use a separate selection file:

# Run all MCP-unique tasks across 2 configs
bash configs/run_selected_tasks.sh --selection-file configs/selected_mcp_unique_tasks.json

# Filter by use-case category
bash configs/run_selected_tasks.sh --selection-file configs/selected_mcp_unique_tasks.json --benchmark ccb_mcp_security

# Dry run
bash configs/run_selected_tasks.sh --selection-file configs/selected_mcp_unique_tasks.json --dry-run

All runners support --baseline-only, --full-only, --task TASK_ID, and --parallel N flags.


Quality Assurance & Validation

CodeContextBench includes a multi-stage QA pipeline to ensure task integrity, reproducible runs, and accurate scoring.

Phase Script Purpose
Pre-flight scripts/validate_tasks_preflight.py Catches truncated instructions, template placeholders, language/difficulty mismatches, missing test.sh
Infra check scripts/check_infra.py Verifies OAuth tokens (all accounts), Docker, disk space, Harbor CLI
Error fingerprinting scripts/status_fingerprints.py Classifies failures with 12 regex patterns; auto-retry guidance per pattern
Post-run scripts/validate_task_run.py Flags crashes, MCP tool usage anomalies, suspicious scoring
Metadata sync scripts/sync_task_metadata.py Keeps task.toml in sync with selected_benchmark_tasks.json; --fix to auto-update
Run analysis scripts/aggregate_status.py Scans run dirs, classifies per-task status, writes status.json, supports --watch mode

The QA methodology uses a 6-dimension audit framework: instruction contamination, reproducibility, verifier correctness, ghost/false-positive detection, error misclassification, and tool effectiveness analysis.

See docs/QA_PROCESS.md for the full pipeline documentation and docs/ERROR_CATALOG.md for the known error catalog.


Operational Tooling

Key scripts organized by workflow phase:

Phase Script Usage
Pre-run validate_tasks_preflight.py python3 scripts/validate_tasks_preflight.py [--suite ccb_pytorch] [--task sgt-001]
Pre-run check_infra.py python3 scripts/check_infra.py
During run aggregate_status.py --since 2h python3 scripts/aggregate_status.py --since 2h
Post-run aggregate_status.py python3 scripts/aggregate_status.py [--watch]
Post-run validate_task_run.py python3 scripts/validate_task_run.py <run_dir>
Analysis compare_configs.py python3 scripts/compare_configs.py
Analysis cost_report.py python3 scripts/cost_report.py
Analysis generate_manifest.py python3 scripts/generate_manifest.py
Maintenance sync_task_metadata.py python3 scripts/sync_task_metadata.py [--fix]
Maintenance archive_run.py python3 scripts/archive_run.py <run_dir> [--compress]
Maintenance rerun_failed.py python3 scripts/rerun_failed.py [--fingerprint timeout] [--suite ccb_pytorch]

AI Agent Skills

The skills/ directory contains structured runbooks for AI coding agents operating on this repository. These encode operational workflows — infrastructure checks, task validation, failure triage, report generation — so any agent (Claude Code, Cursor, Copilot, etc.) can follow them autonomously.

Category Skills Description
CCB Operations 20 skills in 6 files Pre-run checks, monitoring, triage, analysis, maintenance, task authoring
General Purpose 11 skills in 4 files Session management, agent delegation, search patterns, dev practices

Skills are plain markdown and tool-agnostic. See skills/README.md for the full index and integration guides for Cursor, Claude Code, and other agents. See docs/SKILLS.md for background on the skills system.


License

See LICENSE.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors