I build verification tooling for AI-era work. As models and agents produce more claims than anyone can check, the dangerous ones are the confident-sounding numbers nobody verified. One rule runs through everything I ship: no belief without verification — every result carries evidence a third party can re-run. The badge is the claim; the honest no is part of the deliverable.
Offensive technique applied under explicit authorization, with audit. I treat "am I allowed to test this?" and "can I prove what I did?" as first-class, enforced constraints — not afterthoughts.
A deny-by-default authorization gate: test only what you're explicitly authorized to. Ships with a responsible-research charter. The boundary that makes dual-use work safe.
Responsible coordinated-disclosure research across deployed AI-infrastructure / MCP-profile targets. Findings verified locally with reproducible PoCs before contact; novelty- and scope-checked.
Append-only, hash-chained records of what was tested and found — the same evidence standard that governs the rest of my work, applied to security engagements.
Do MCP security scanners actually catch authorization-logic bugs? An independent, reproducible benchmark seeded with real confirmed findings: two mature SASTs catch the control bugs but miss the authz-logic class. Scanners run only in a disposable CI runner.
Can a company be run as agents? Only if you can trust what each agent says it did — so I built the whole loop around one primitive: verification.
Refuses a "95% accuracy" claim until it clears statistical hygiene — sample, out-of-sample, leakage, lift over base rate — then proves it: a claim ships a re-runnable command and the number must reproduce or CI fails. 17 domain packs · CI gate · MCP tool.
A working prototype of a company run as agents — a 13-stage pipeline where every result-claim is verity-gated and every action hash-chain-logged, observable in a live console. Tests run locally; CI badge pending.
Agents run a real workstation — but every action routes through a deny-by-default reference monitor first: read and local work proceeds, anything outward or destructive is held for a human, money and credentials are refused. Model proposes, code disposes — every decision hash-chained, every window operator-summoned. No capability the gate didn't grant.
Citation faithfulness you can re-run to the same hash: a cryptographically committed corpus scored offline, byte-identical across machines — where RAG eval (RAGAS/ARES) is online, metered, and uncommittable. Reports where the scorer fails, not just the flattering number.
A public, hash-chained prediction record scored over time (Brier + calibration). Honesty you can't doctor — it reports the real number whether there's an edge or not.
Adjudicates a published benchmark claim against its raw run-logs — REPRODUCED / DID-NOT-REPRODUCE / CHERRY-PICKED — sealed into a re-runnable receipt. Surfaces the dropped, flipped, and fabricated rows that re-run leaderboards and reproducibility badges miss; survived a 3-lens adversarial pass.
Flag unverified claims in AI output; verify every cited claim is supported by its source; a research agent that grounds its answers or abstains rather than fabricate.
Not a portfolio of assertions — a portfolio you can re-run.