A cybersecurity benchmarking framework for model evaluation. The repository provides a taxonomy-aligned evaluation pipeline, adapter-based suite orchestration, and a compatibility layer for the current public API.
This repository separates the public CLI and compatibility wrappers from the core runtime:
- CLI boundary:
src/model_benchmarking_cli/contains the public command surface and adapter wiring. - Runtime core:
src/runtime/holds providers, suites, taxonomy, and evaluation logic. - Compatibility layer:
src/mcp/remains as a transitional import surface for existing callers and tests. - Repo guidance:
.agents/contains workflow and steering notes for contributors and agents.
# Clone and setup environment
git clone https://github.com/bannff/model-benchmarking.git
cd model-benchmarking
python3 -m venv .venv && source .venv/bin/activate
# Install with robust development and testing tools
pip install -e "."# Execute a mathematically-pure pipeline dry-run relying on the mock provider
mbenchmark pipeline --dry-runThe repository evaluates targets by fusing dynamic benchmarking environments with an exhaustive cybersecurity dataset schema.
graph TD
CLI[src/model_benchmarking_cli/cli/] --> PIPELINE[src/model_benchmarking_cli/pipeline.py]
PIPELINE -- Async Orchestration --> SUITES[src/runtime/suites/]
SUITES --> TAXONOMY[src/runtime/taxonomy/]
SUITES --> PROVIDERS[src/runtime/providers/]
TAXONOMY --> REGISTRY[(Registry Configs)]
PROVIDERS --> TARGETS((Target LLMs))
style CLI fill:#a2d2ff,stroke:#111,stroke-width:2px,color:#000
style PIPELINE fill:#bde0fe,stroke:#111,stroke-width:2px,color:#000
style TAXONOMY fill:#ffc8dd,stroke:#111,stroke-width:2px,color:#000
style SUITES fill:#ffafcc,stroke:#111,stroke-width:2px,color:#000
Comprehensive Q&A benchmarks testing theoretical foundations and security reasoning.
mbenchmark run --suite cs-eval
Real-world vulnerability exploitation challenges in isolated container environments.
mbenchmark run --suite cve-bench
Multi-step, interactive security missions testing agentic decision-making.
mbenchmark run --suite cybergym
- 🛠 CLI Reference — Master the
mbenchmarkcommand. - 📂 Taxonomy Protocol — Explore the built-in security hierarchy and dynamic YAML mappers.
- 🤖 Agent Onboarding — Core setup guide for integrated AI collaborators.
- 🧭 Steering Documentation — Advanced agent blueprints, skills, and
<200 LOCrefactoring recipes.
This repository contains intentionally vulnerable code and exploit patterns for research and evaluation purposes.
NEVER run these benchmarks against production systems. Always use the provided Docker isolation primitives.
This project is licensed under the Business Source License 1.1. See LICENSE for the full text.
Model-Benchmarking by Bannff.