🌌 Model Benchmarking

A cybersecurity benchmarking framework for model evaluation. The repository provides a taxonomy-aligned evaluation pipeline, adapter-based suite orchestration, and a compatibility layer for the current public API.

Engineering Overview

This repository separates the public CLI and compatibility wrappers from the core runtime:

CLI boundary: src/model_benchmarking_cli/ contains the public command surface and adapter wiring.
Runtime core: src/runtime/ holds providers, suites, taxonomy, and evaluation logic.
Compatibility layer: src/mcp/ remains as a transitional import surface for existing callers and tests.
Repo guidance: .agents/ contains workflow and steering notes for contributors and agents.

🚀 Quick Start

1️⃣ Installation

# Clone and setup environment
git clone https://github.com/bannff/model-benchmarking.git
cd model-benchmarking
python3 -m venv .venv && source .venv/bin/activate

# Install with robust development and testing tools
pip install -e "."

2️⃣ Run Your First Eval

# Execute a mathematically-pure pipeline dry-run relying on the mock provider
mbenchmark pipeline --dry-run

🏗️ Architecture

The repository evaluates targets by fusing dynamic benchmarking environments with an exhaustive cybersecurity dataset schema.

graph TD
    CLI[src/model_benchmarking_cli/cli/] --> PIPELINE[src/model_benchmarking_cli/pipeline.py]
    PIPELINE -- Async Orchestration --> SUITES[src/runtime/suites/]
    
    SUITES --> TAXONOMY[src/runtime/taxonomy/]
    SUITES --> PROVIDERS[src/runtime/providers/]
    
    TAXONOMY --> REGISTRY[(Registry Configs)]
    PROVIDERS --> TARGETS((Target LLMs))
    
    style CLI fill:#a2d2ff,stroke:#111,stroke-width:2px,color:#000
    style PIPELINE fill:#bde0fe,stroke:#111,stroke-width:2px,color:#000
    style TAXONOMY fill:#ffc8dd,stroke:#111,stroke-width:2px,color:#000
    style SUITES fill:#ffafcc,stroke:#111,stroke-width:2px,color:#000

📊 Benchmark Suites

📝 CS-Eval (Security Knowledge)

Comprehensive Q&A benchmarks testing theoretical foundations and security reasoning.
mbenchmark run --suite cs-eval

🐛 CVE-Bench (Exploit Dev)

Real-world vulnerability exploitation challenges in isolated container environments.
mbenchmark run --suite cve-bench

🏟️ CyberGym (Interactive Scenarios)

Multi-step, interactive security missions testing agentic decision-making.
mbenchmark run --suite cybergym

🧭 Navigation

🛠 CLI Reference — Master the mbenchmark command.
📂 Taxonomy Protocol — Explore the built-in security hierarchy and dynamic YAML mappers.
🤖 Agent Onboarding — Core setup guide for integrated AI collaborators.
🧭 Steering Documentation — Advanced agent blueprints, skills, and <200 LOC refactoring recipes.

🛡 Safety & Disclosure

This repository contains intentionally vulnerable code and exploit patterns for research and evaluation purposes.
NEVER run these benchmarks against production systems. Always use the provided Docker isolation primitives.

📜 License

This project is licensed under the Business Source License 1.1. See LICENSE for the full text.
Model-Benchmarking by Bannff.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.agents		.agents
.beads		.beads
.devcontainer		.devcontainer
.github		.github
benchmark_test		benchmark_test
configs		configs
docs		docs
examples		examples
sanitized_db/src/cvebench/challenges		sanitized_db/src/cvebench/challenges
scripts		scripts
src		src
tests		tests
tools		tools
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitleaks.toml		.gitleaks.toml
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
SECURITY.md		SECURITY.md
docker-compose.yml		docker-compose.yml
mypy.ini		mypy.ini
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🌌 Model Benchmarking

Engineering Overview

🚀 Quick Start

1️⃣ Installation

2️⃣ Run Your First Eval

🏗️ Architecture

📊 Benchmark Suites

📝 CS-Eval (Security Knowledge)

🐛 CVE-Bench (Exploit Dev)

🏟️ CyberGym (Interactive Scenarios)

🧭 Navigation

🛡 Safety & Disclosure

📜 License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🌌 Model Benchmarking

Engineering Overview

🚀 Quick Start

1️⃣ Installation

2️⃣ Run Your First Eval

🏗️ Architecture

📊 Benchmark Suites

📝 CS-Eval (Security Knowledge)

🐛 CVE-Bench (Exploit Dev)

🏟️ CyberGym (Interactive Scenarios)

🧭 Navigation

🛡 Safety & Disclosure

📜 License

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages