Skip to content

bannff/model-benchmarking

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🌌 Model Benchmarking

License: BSL 1.1 Python 3.10+ Status: Agent-Ready Build: 100% Green

A cybersecurity benchmarking framework for model evaluation. The repository provides a taxonomy-aligned evaluation pipeline, adapter-based suite orchestration, and a compatibility layer for the current public API.


Engineering Overview

This repository separates the public CLI and compatibility wrappers from the core runtime:

  1. CLI boundary: src/model_benchmarking_cli/ contains the public command surface and adapter wiring.
  2. Runtime core: src/runtime/ holds providers, suites, taxonomy, and evaluation logic.
  3. Compatibility layer: src/mcp/ remains as a transitional import surface for existing callers and tests.
  4. Repo guidance: .agents/ contains workflow and steering notes for contributors and agents.

🚀 Quick Start

1️⃣ Installation

# Clone and setup environment
git clone https://github.com/bannff/model-benchmarking.git
cd model-benchmarking
python3 -m venv .venv && source .venv/bin/activate

# Install with robust development and testing tools
pip install -e "."

2️⃣ Run Your First Eval

# Execute a mathematically-pure pipeline dry-run relying on the mock provider
mbenchmark pipeline --dry-run

🏗️ Architecture

The repository evaluates targets by fusing dynamic benchmarking environments with an exhaustive cybersecurity dataset schema.

graph TD
    CLI[src/model_benchmarking_cli/cli/] --> PIPELINE[src/model_benchmarking_cli/pipeline.py]
    PIPELINE -- Async Orchestration --> SUITES[src/runtime/suites/]
    
    SUITES --> TAXONOMY[src/runtime/taxonomy/]
    SUITES --> PROVIDERS[src/runtime/providers/]
    
    TAXONOMY --> REGISTRY[(Registry Configs)]
    PROVIDERS --> TARGETS((Target LLMs))
    
    style CLI fill:#a2d2ff,stroke:#111,stroke-width:2px,color:#000
    style PIPELINE fill:#bde0fe,stroke:#111,stroke-width:2px,color:#000
    style TAXONOMY fill:#ffc8dd,stroke:#111,stroke-width:2px,color:#000
    style SUITES fill:#ffafcc,stroke:#111,stroke-width:2px,color:#000
Loading

📊 Benchmark Suites

📝 CS-Eval (Security Knowledge)

Comprehensive Q&A benchmarks testing theoretical foundations and security reasoning.
mbenchmark run --suite cs-eval

🐛 CVE-Bench (Exploit Dev)

Real-world vulnerability exploitation challenges in isolated container environments.
mbenchmark run --suite cve-bench

🏟️ CyberGym (Interactive Scenarios)

Multi-step, interactive security missions testing agentic decision-making.
mbenchmark run --suite cybergym


🧭 Navigation


🛡 Safety & Disclosure

This repository contains intentionally vulnerable code and exploit patterns for research and evaluation purposes.
NEVER run these benchmarks against production systems. Always use the provided Docker isolation primitives.


📜 License

This project is licensed under the Business Source License 1.1. See LICENSE for the full text.
Model-Benchmarking by Bannff.

About

A benchmarking suite for different cybersec models

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors