StellarRequiem · Verification Infrastructure

Alex Price / StellarRequiem

Verified work, or it doesn't ship.

I build verification tooling for AI-era work. As models and agents produce more claims than anyone can check, the dangerous ones are the confident-sounding numbers nobody verified. One rule runs through everything I ship: no belief without verification — every result carries evidence a third party can re-run. The badge is the claim; the honest no is part of the deliverable.

Public repos 29 CI all-green Calibration-log live Proof re-runnable Honest gaps stated

Responsible Security Research

Offensive technique applied under explicit authorization, with audit. I treat "am I allowed to test this?" and "can I prove what I did?" as first-class, enforced constraints — not afterthoughts.

Authorization framework · public

scope-gate

A deny-by-default authorization gate: test only what you're explicitly authorized to. Ships with a responsible-research charter. The boundary that makes dual-use work safe.

PublicDeny-by-default
Coordinated disclosure

Vulnerability research

Responsible coordinated-disclosure research across deployed AI-infrastructure / MCP-profile targets. Findings verified locally with reproducible PoCs before contact; novelty- and scope-checked.

Disclosure — in progress
Discipline

Audit everything

Append-only, hash-chained records of what was tested and found — the same evidence standard that governs the rest of my work, applied to security engagements.

Append-onlyHash-chained
Benchmark · public

mcp-bench

Do MCP security scanners actually catch authorization-logic bugs? An independent, reproducible benchmark seeded with real confirmed findings: two mature SASTs catch the control bugs but miss the authz-logic class. Scanners run only in a disposable CI runner.

PublicReproducibleauthz-logic 0/3
vulnerability researchPoC development authorization-gated testingcoordinated disclosure MCP / AI-infra securityred-team tooling

Verified AI Labor — the platform

Can a company be run as agents? Only if you can trust what each agent says it did — so I built the whole loop around one primitive: verification.

The gate

verity-core

Refuses a "95% accuracy" claim until it clears statistical hygiene — sample, out-of-sample, leakage, lift over base rate — then proves it: a claim ships a re-runnable command and the number must reproduce or CI fails. 17 domain packs · CI gate · MCP tool.

Proof-carryingCI gate
The labor

verified-ai-labor

A working prototype of a company run as agents — a 13-stage pipeline where every result-claim is verity-gated and every action hash-chain-logged, observable in a live console. Tests run locally; CI badge pending.

Verity-gatedHash-chained
Governed autonomy

The operating surface

Agents run a real workstation — but every action routes through a deny-by-default reference monitor first: read and local work proceeds, anything outward or destructive is held for a human, money and credentials are refused. Model proposes, code disposes — every decision hash-chained, every window operator-summoned. No capability the gate didn't grant.

Deny-by-defaultHash-chainedOperator-gated
The benchmark

groundtruth-bench

Citation faithfulness you can re-run to the same hash: a cryptographically committed corpus scored offline, byte-identical across machines — where RAG eval (RAGAS/ARES) is online, metered, and uncommittable. Reports where the scorer fails, not just the flattering number.

Byte-reproducibleCommitted corpus
The proof

calibration-log

A public, hash-chained prediction record scored over time (Brier + calibration). Honesty you can't doctor — it reports the real number whether there's an edge or not.

LiveHash-chained
The adjudicator

scorecheck

Adjudicates a published benchmark claim against its raw run-logs — REPRODUCED / DID-NOT-REPRODUCE / CHERRY-PICKED — sealed into a re-runnable receipt. Surfaces the dropped, flipped, and fabricated rows that re-run leaderboards and reproducibility badges miss; survived a 3-lens adversarial pass.

ReproducedCherry-pickedCI-green
Trust tooling

firewall · grounded · reality-anchor

Flag unverified claims in AI output; verify every cited claim is supported by its source; a research agent that grounds its answers or abstains rather than fabricate.

DeterministicOffline

How my work is verifiable

Not a portfolio of assertions — a portfolio you can re-run.

01
CI-green or it doesn't ship. Public repos run their own tests; where a repo has CI, the green badge is the claim.
02
Runnable proofs. Results come with the exact command a third party executes to reproduce them.
03
A public calibration log. Predictions are hash-chained and scored over time — the honesty is auditable, not asserted.
04
Honest gaps, stated. Every deliverable names what it did not verify. An unverifiable claim does not count.