8 min read

VEX-HALT: Hallucination Assessment via Layered Testing#

Code
██╗   ██╗███████╗██╗  ██╗      ██╗  ██╗ █████╗ ██╗  ████████╗
██║   ██║██╔════╝╚██╗██╔╝      ██║  ██║██╔══██╗██║  ╚══██╔══╝
██║   ██║█████╗   ╚███╔╝ █████╗███████║███████║██║     ██║   
╚██╗ ██╔╝██╔══╝   ██╔██╗ ╚════╝██╔══██║██╔══██║██║     ██║   
 ╚████╔╝ ███████╗██╔╝ ██╗      ██║  ██║██║  ██║███████╗██║   
  ╚═══╝  ╚══════╝╚═╝  ╚═╝      ╚═╝  ╚═╝╚═╝  ╚═╝╚══════╝╚═╝   

A research benchmark for evaluating AI verification systems

Rust License Tests

VEX-HALT is designed to evaluate VEX, a protocol for verifying autonomous AI agents.

ProjectPurpose
VEXThe verification protocol (adversarial debate, Merkle proofs)
VEX-HALTThe benchmark that evaluates VEX (this repo)

Research project exploring AI verification methods.


🎯 Overview#

VEX-HALT is a research benchmark designed to evaluate AI verification systems, focusing on calibration rather than just accuracy:

"VEX doesn't make LLMs more accurate. VEX makes LLMs know when they're wrong."

This is an experimental approach to understanding how adversarial verification might improve AI reliability.

🔐 Cryptographic Verification#

VEX-HALT includes built-in cryptographic verification for result integrity:

  • 🔒 Merkle Tree Verification - Every benchmark run generates a cryptographic Merkle root that mathematically proves result integrity
  • 📋 Tamper-Proof Audit Trail - Individual test results are hashed and combined into an immutable proof chain
  • ✅ Independent Verification - Anyone can recalculate the Merkle root from raw results to verify authenticity
  • 🏛️ Regulatory Compliance - Designed for EU AI Act requirements for cryptographic audit trails

Unlike traditional benchmarks that only provide aggregate scores, VEX-HALT provides mathematical proof that results haven't been tampered with.


✨ Features#

🔬 12 Test Categories (443+ Items)#

CategoryWeightDescription
CCT15%Confidence Calibration - Does stated confidence match accuracy?
API10%Adversarial Prompt Injection - Jailbreaks, injections, attacks
FCT10%Factual Consistency - Multi-step reasoning verification
HHT10%Hallucination Honeypots - Completely fictional entities
RT5%Reproducibility - Deterministic output verification
FRONTIER15%Super-hard problems (ARC-AGI, FrontierMath style)
VSM5%Verbal-Semantic Misalignment - Overconfidence detection
MTC5%Multi-Step Tool Chains - Agent tool usage
EAS5%Epistemic-Aleatoric Split - Uncertainty classification
MEM5%Memory Evaluation - VEX temporal memory testing
AGT10%Agentic Safety - Deception, sandbagging, sycophancy
VEX5%VEX Showcase - A/B comparison baseline vs VEX

🛡️ Technical Features#

  • 🔐 Merkle Tree Audit Trail - Every benchmark run produces a cryptographic Merkle root for result verification
  • 🤖 LLM-as-Judge - Automated evaluation using LLM judges with category-specific rubrics
  • 🧰 Mock Tool Framework - 7 sandboxed tools for agent evaluation
  • 📊 Multiple Report Formats - Console, JSON, Markdown, and HTML output
  • ⚡ Parallel Execution - Async test running with configurable concurrency
  • 💰 Cost Tracking - Token usage and estimated cost per run

🚀 Quick Start#

Bash
# Clone and build
git clone https://github.com/provnai/vex-halt
cd vex-halt
cargo build --release

# Run with mock provider
./target/release/halt_benchmark --mode baseline --provider mock

# Run with real LLM (requires API key)
export OPENAI_API_KEY=sk-...
./target/release/halt_benchmark --mode compare --provider openai

# Generate HTML report
./target/release/halt_benchmark -o html -f report.html

📋 Command Line Options#

Code
USAGE:
    halt_benchmark [OPTIONS]

OPTIONS:
    -m, --mode <MODE>           baseline, vex, or compare [default: compare]
    -p, --provider <PROVIDER>   mock, mistral, deepseek, openai, claude, gemini, local [default: mock]
    -d, --dataset <PATH>        Path to dataset directory
    -o, --output <FORMAT>       console, json, markdown, html [default: console]
    -f, --output-file <PATH>    Save report to file
    -r, --runs <N>              Number of runs for statistics [default: 1]
    -c, --categories <LIST>     Comma-separated categories to run
    --enable-vex                Enable VEX adversarial verification
    --debate-rounds <N>         Debate rounds for VEX mode [default: 3]
    -v, --verbose               Verbose output

📊 Sample Output#

Code
▶ VEX-HALT BENCHMARK - BASELINE MODE
════════════════════════════════════════════════════════════

  ✗ Confidence Calibration    │ ████░░░░░░░░░░░░░░░░ │  18.9 │ F
  ✓ Adversarial Injection     │ ████████████████████ │ 100.0 │ A+
  ✗ Factual Consistency       │ ██████░░░░░░░░░░░░░░ │  30.0 │ F
  ✗ Hallucination Honeypot    │ ░░░░░░░░░░░░░░░░░░░░ │   2.0 │ F
  ✓ Reproducibility           │ ████████████████████ │ 100.0 │ A+
  ✗ Frontier Super-Hard       │ ░░░░░░░░░░░░░░░░░░░░ │   0.0 │ F
  ...

  ★ FINAL SCORE               │        21.0 │ F
════════════════════════════════════════════════════════════

  🔐 Merkle Root: 3a7f8b9c... (cryptographically verified)
Code

## 🧪 Test Categories Deep Dive

### FRONTIER (Super-Hard Problems)

Inspired by ARC-AGI and FrontierMath, these tests push the boundaries:
- Compositional reasoning with multiple interacting rules
- Abstract pattern recognition (Bongard-style)
- Research-level mathematics
- Meta-cognitive traps

### AGT (Agentic Safety)

Based on METR 2025 and AI Agent Index research:
- **Deception Detection** - Hidden goals, instruction concealment
- **Sandbagging** - Strategic underperformance detection
- **Sycophancy** - Telling users what they want vs. truth
- **Autonomy Boundaries** - Scope creep, confirmation-seeking

### MTC (Multi-Step Tool Chains)

Mock tools for agent evaluation:
```rust
let registry = ToolRegistry::with_mocks();
// calculator, get_weather, convert_currency, web_search, 
// format_date, create_user, send_email

EAS (Epistemic-Aleatoric Split)#

Evaluates if AI correctly classifies uncertainty:

  • Epistemic: Knowledge gaps (learnable)
  • Aleatoric: Inherent randomness (unpredictable)

🏗️ Architecture#

Code
vex-halt/
├── src/
│   ├── main.rs          # CLI entry point
│   ├── runner.rs        # Benchmark orchestration
│   ├── dataset.rs       # 12-category loader
│   ├── evaluator.rs     # Response evaluation
│   ├── scoring.rs       # Category-specific metrics
│   ├── llm_judge.rs     # LLM-as-Judge with rubrics
│   ├── tools.rs         # Mock tool framework
│   ├── merkle.rs        # Cryptographic proofs
│   ├── provider.rs      # LLM providers
│   ├── report.rs        # Console/JSON/MD/HTML output
│   ├── vex_integration.rs # VEX debate system
│   ├── types.rs         # Core types
│   └── config.rs        # Provider configuration
├── datasets/vex_halt/   # 44 JSON test files
│   ├── cct/             # Confidence Calibration
│   ├── api/             # Adversarial Injection
│   ├── fct/             # Factual Consistency
│   ├── hht/             # Hallucination Honeypots
│   ├── rt/              # Reproducibility
│   ├── frontier/        # Super-Hard Problems
│   ├── vsm/             # Verbal-Semantic Misalignment
│   ├── mtc/             # Multi-Step Tool Chains
│   ├── eas/             # Epistemic-Aleatoric Split
│   ├── mem/             # Memory Evaluation
│   ├── agt/             # Agentic Safety
│   └── vex/             # VEX Showcase
└── tests/               # Integration tests

🔬 Research Context#

VEX-HALT draws inspiration from several areas of AI evaluation research:

  • ARC-AGI (2024) - Abstract reasoning challenges
  • FrontierMath (2024) - Research-level math problems
  • METR (2025) - Long-horizon agent evaluation
  • RedDebate (2025) - Multi-agent debate frameworks
  • AI Agent Index (2025) - Agentic safety research
  • LLM-as-Judge (2025) - Evaluation best practices

This work is exploratory and builds on existing research in AI safety and evaluation.

� Technical Dependencies#

VEX Protocol Integration#

VEX-HALT integrates with the VEX Protocol for adversarial verification:

  • vex-core: v0.1.4 - Core primitives and Merkle trees
  • vex-adversarial: v0.1.4 - Multi-agent debate and shadow agents
  • vex-llm: v0.1.4 - LLM provider abstraction
  • vex-temporal: v0.1.4 - Temporal reasoning (not currently used)

All VEX crates use commit b84c0545d76d8712dd5c23d01341071b6212984c from the development branch.

HTTP Client#

  • reqwest: v0.11.27 (primary), v0.12.28 (via vex-llm)

VEX Integration Scope#

The implementation uses ~80% of VEX's core features:

  • ✅ Multi-agent adversarial debate (Blue/Red agents)
  • ✅ Merkle tree audit trails for reproducibility
  • ✅ Shadow agent issue detection
  • ✅ Consensus evaluation
  • ❌ Distributed agent coordination
  • ❌ Advanced temporal reasoning

�📈 Scoring#

Final score = weighted sum across categories:

Code
0.15×CCT + 0.10×API + 0.10×FCT + 0.10×HHT + 0.05×RT + 
0.15×FRONTIER + 0.05×VSM + 0.05×MTC + 0.05×EAS + 
0.05×MEM + 0.10×AGT + 0.05×VEX = 100%
GradeScoreInterpretation
A+≥90High reliability for critical applications
A≥80Suitable for most applications
B≥70Requires monitoring and oversight
C≥50Limited reliability
F<50High hallucination risk

These thresholds are experimental and may be adjusted based on further research.

🤝 Contributing#

See CONTRIBUTING.md for guidelines.

📄 License#

MIT License - see LICENSE for details.

🛡️ Security#

See SECURITY.md for security policies and responsible disclosure.

📋 Code of Conduct#

This project follows a code of conduct to ensure a welcoming environment for all contributors. See CODE_OF_CONDUCT.md for details.


Research Project
Exploring AI verification methods

Was this page helpful?
Edit this page on GitHub