Ultrafix
Debug a stubborn bug with isolated git worktrees for parallel hypothesis testing plus structured logging — reproduce, isolate, find root cause, verify a minimal fix, clean up. Any repo.
How to use it
Comes with the ai-devkit plugin (install — two commands). Once installed, type this in any repo:
/ultrafix tests flake on ci but pass locallyYou can also just describe the task in plain words — the agent picks the skill up by itself. That’s the Claude Code form — in Codex, type $ultrafix instead, or just name the skill in plain words.
What a run looks like:
That’s all you do — the agent runs the whole workflow itself. Curious, or want to audit it? The playbook it follows is collapsed under Under the hood below.
Under the hood — the playbook the agent follows nothing in here is for you to run
Everything in this section is read and executed by the agent when you invoke the skill. It’s published so you can audit it, learn from it, or adapt it — not because you need to follow it yourself.
Hunt down a hard bug methodically instead of guessing. The core idea: reproduce first, isolate hypotheses in separate git worktrees so you can test several at once without polluting your workspace, add structured logging to see what's actually happening, and only fix once the evidence names a root cause. Evidence before fix, always.
Bug Description:#
$ARGUMENTS
Workflow#
Phase 1: Reproduce Reliably#
You cannot fix what you can't trigger on demand.
# Detect the project's test/run command (CI workflow first, then manifest/Makefile) — don't hardcode it.
# Then run the minimal reproduction and capture the exact failure.
$REPRODUCE_CMD 2>&1 | tee /tmp/ultrafix-repro.logPin down: exact command, environment, frequency (always vs flaky), and the precise error/symptom. If it's intermittent (e.g. "flakes on CI"), note the variables that differ between passing and failing runs (parallelism, ordering, timezone, network, clock, resource limits). If you cannot reproduce it at all, stop and gather more signal — a fix you can't verify is a guess.
Phase 2: Form Hypotheses#
From the symptom and a quick read of the suspect code, list concrete, falsifiable hypotheses — each one a specific claim you can prove or disprove:
H1 — Test order dependency: a shared fixture leaks state between tests.
H2 — Race: an async task isn't awaited, so assertions run before it completes.
H3 — Environment: CI's timezone/locale differs and a date comparison flips.Rank by likelihood × cheapness to test. Use the Task tool to investigate the codebase for evidence supporting or killing each one before spinning up worktrees.
Phase 3: Isolate Each Hypothesis in a Worktree#
Give every non-trivial hypothesis its own git worktree so probes (extra logging, a candidate fix, a config tweak) never collide and can run in parallel:
ROOT=$(git rev-parse --show-toplevel)
BASE=$(git branch --show-current)
for H in h1 h2 h3; do
git worktree add -b "ultrafix/$H" "../${ROOT##*/}-$H" "$BASE"
done
git worktree listEach worktree is a clean, independent checkout: instrument H1 in one, try a fix for H2 in another, and they don't interfere.
Phase 4: Add Structured Debug Logging#
In the relevant worktree, add structured, greppable log lines (not bare prints) at decision points around the suspect path — adapt the mechanism to your stack's logger:
# Tag lines so you can filter them out of noise and find them again to remove.
LOG "[ULTRAFIX h2] entering retry loop attempt=$n state=$state elapsed_ms=$dt"Conventions that pay off:
- A consistent prefix/tag (
[ULTRAFIX hN]) per hypothesis → easygrep, easy cleanup. - Log inputs, branch taken, and timing at each fork — enough to reconstruct the actual execution order.
- For races/flakes, log timestamps and which task/thread emitted the line.
Phase 5: Test Each Hypothesis in Isolation#
Run the reproduction in each worktree and read the structured output:
( cd "../${ROOT##*/}-h2" && $REPRODUCE_CMD 2>&1 | tee /tmp/ultrafix-h2.log )
grep '\[ULTRAFIX h2\]' /tmp/ultrafix-h2.logFor flaky bugs, loop the run to make the signal statistically real:
( cd "../${ROOT##*/}-h2"; for i in $(seq 1 20); do $REPRODUCE_CMD >/dev/null 2>&1 && echo "run $i PASS" || echo "run $i FAIL"; done )Each result confirms or kills a hypothesis. Kill the wrong ones explicitly — narrowing is progress.
Phase 6: Name the Root Cause#
Converge on one explanation that the evidence supports end-to-end: the logs show the bad state arising, the isolating worktree reproduces it, and removing the cause makes it stop. Write the root cause in one or two sentences — what is wrong and why it produces the symptom. Don't fix until you can state this.
Phase 7: Implement a Minimal Fix + Verify#
In the winning worktree (or back on $BASE), make the smallest change that addresses the root cause — no opportunistic refactors riding along.
# Verify against the original reproduction. For flakes, loop enough to trust it.
$REPRODUCE_CMD 2>&1 | tee /tmp/ultrafix-verify.log
# Flaky case: confirm a long green streak (e.g. 20–50 runs) before believing it.A fix is "verified" only when the reproduction that used to fail now passes repeatedly — not when the code "looks right."
Phase 8: Clean Up#
First make sure the winning fix is on your working branch — if it only lives on a hypothesis branch, bring it over (cherry-pick or merge) before deleting anything.
# 1) Salvage check: any commit listed here is UNIQUE to a scratch branch — move it to your
# working branch before cleanup, or you will lose it.
for h in h1 h2 h3; do git log --oneline "$BASE..ultrafix/$h" 2>/dev/null; done
# 2) Remove every debug probe you added.
grep -rn 'ULTRAFIX' .
# 3) Tear down the scratch worktrees, then delete the branches with -d (NOT -D).
# -d refuses to delete a branch with unmerged commits, so it can't silently drop the fix.
git worktree remove "../${ROOT##*/}-h1"
git worktree remove "../${ROOT##*/}-h2"
git worktree remove "../${ROOT##*/}-h3"
git branch -d ultrafix/h1 ultrafix/h2 ultrafix/h3 # errors here = unsalvaged commits; review before forcing
git worktree pruneOnly fall back to git branch -D after you have confirmed the branch's unique commits are already on your working branch. Leave only the minimal fix (plus a regression test, if one was missing) on the working branch.
Quick Reference#
The loop#
Reproduce → hypothesize → isolate (worktree) → instrument → test → root cause → minimal fix → verify → clean up.
Discipline#
- No fix without a reproduction. If you can't trigger it, you can't verify it.
- Evidence before fix. Logs + isolation name the cause; you don't guess it.
- One hypothesis per worktree. Parallel probes, zero cross-contamination.
- Minimal fix. Address the root cause only; no drive-by refactors.
- Verify by repetition for flakes — one green run proves nothing.
Cheatsheet#
- Worktree:
git worktree add -b ultrafix/hN ../repo-hN <base>→git worktree remove <path>→git worktree prune. - Logging: tag every probe (
[ULTRAFIX hN] …) so it greps cleanly and is trivial to delete; log inputs, branch taken, timing; add timestamps/thread for races.
Adapt it to another stack ready-made prompt for Claude / Codex
Adapt to your platform
Copy this prompt into Claude or Codex, fill in your stack, and it will generate an adapted version for your project.
Port an agent skill (Claude Code / Codex SKILL.md format) to my project.
Below is "ultrafix" from ai-devkit (origin platform: cross, portability: adaptable).
What it does: Debug a stubborn bug with isolated git worktrees for parallel hypothesis testing plus structured logging — reproduce, isolate, find root cause, verify a minimal fix, clean up. Any repo.
Read it, then produce an equivalent SKILL.md (plus any scripts) for MY project:
<describe your stack, conventions, and repo here>
Keep the workflow's structure and intent. Replace platform-specific tooling and references per these notes:
Adapt the logging mechanism and the test/reproduce command to your stack (detect the project's test command the same way the rest of the toolkit does). The worktree-per-hypothesis isolation and the evidence-before-fix discipline are portable as-is.
List what you changed and why.
--- SKILL.md ---
# Ultrafix
Hunt down a hard bug methodically instead of guessing. The core idea: **reproduce first, isolate hypotheses in separate git worktrees so you can test several at once without polluting your workspace, add structured logging to see what's actually happening, and only fix once the evidence names a root cause.** Evidence before fix, always.
## Bug Description:
$ARGUMENTS
## Workflow
### Phase 1: Reproduce Reliably
You cannot fix what you can't trigger on demand.
```bash
# Detect the project's test/run command (CI workflow first, then manifest/Makefile) — don't hardcode it.
# Then run the minimal reproduction and capture the exact failure.
$REPRODUCE_CMD 2>&1 | tee /tmp/ultrafix-repro.log
```
Pin down: exact command, environment, frequency (always vs flaky), and the precise error/symptom. If it's intermittent (e.g. "flakes on CI"), note the variables that differ between passing and failing runs (parallelism, ordering, timezone, network, clock, resource limits). **If you cannot reproduce it at all, stop and gather more signal** — a fix you can't verify is a guess.
### Phase 2: Form Hypotheses
From the symptom and a quick read of the suspect code, list **concrete, falsifiable** hypotheses — each one a specific claim you can prove or disprove:
```
H1 — Test order dependency: a shared fixture leaks state between tests.
H2 — Race: an async task isn't awaited, so assertions run before it completes.
H3 — Environment: CI's timezone/locale differs and a date comparison flips.
```
Rank by likelihood × cheapness to test. Use the Task tool to investigate the codebase for evidence supporting or killing each one before spinning up worktrees.
### Phase 3: Isolate Each Hypothesis in a Worktree
Give every non-trivial hypothesis its **own git worktree** so probes (extra logging, a candidate fix, a config tweak) never collide and can run in parallel:
```bash
ROOT=$(git rev-parse --show-toplevel)
BASE=$(git branch --show-current)
for H in h1 h2 h3; do
git worktree add -b "ultrafix/$H" "../${ROOT##*/}-$H" "$BASE"
done
git worktree list
```
Each worktree is a clean, independent checkout: instrument H1 in one, try a fix for H2 in another, and they don't interfere.
### Phase 4: Add Structured Debug Logging
In the relevant worktree, add **structured, greppable** log lines (not bare prints) at decision points around the suspect path — adapt the mechanism to your stack's logger:
```
# Tag lines so you can filter them out of noise and find them again to remove.
LOG "[ULTRAFIX h2] entering retry loop attempt=$n state=$state elapsed_ms=$dt"
```
Conventions that pay off:
- A consistent prefix/tag (`[ULTRAFIX hN]`) per hypothesis → easy `grep`, easy cleanup.
- Log **inputs, branch taken, and timing** at each fork — enough to reconstruct the actual execution order.
- For races/flakes, log timestamps and which task/thread emitted the line.
### Phase 5: Test Each Hypothesis in Isolation
Run the reproduction in each worktree and read the structured output:
```bash
( cd "../${ROOT##*/}-h2" && $REPRODUCE_CMD 2>&1 | tee /tmp/ultrafix-h2.log )
grep '\[ULTRAFIX h2\]' /tmp/ultrafix-h2.log
```
For flaky bugs, **loop** the run to make the signal statistically real:
```bash
( cd "../${ROOT##*/}-h2"; for i in $(seq 1 20); do $REPRODUCE_CMD >/dev/null 2>&1 && echo "run $i PASS" || echo "run $i FAIL"; done )
```
Each result confirms or kills a hypothesis. Kill the wrong ones explicitly — narrowing is progress.
### Phase 6: Name the Root Cause
Converge on **one** explanation that the evidence supports end-to-end: the logs show the bad state arising, the isolating worktree reproduces it, and removing the cause makes it stop. Write the root cause in one or two sentences — *what* is wrong and *why* it produces the symptom. Don't fix until you can state this.
### Phase 7: Implement a Minimal Fix + Verify
In the winning worktree (or back on `$BASE`), make the **smallest** change that addresses the root cause — no opportunistic refactors riding along.
```bash
# Verify against the original reproduction. For flakes, loop enough to trust it.
$REPRODUCE_CMD 2>&1 | tee /tmp/ultrafix-verify.log
# Flaky case: confirm a long green streak (e.g. 20–50 runs) before believing it.
```
A fix is "verified" only when the reproduction that used to fail now passes repeatedly — not when the code "looks right."
### Phase 8: Clean Up
**First make sure the winning fix is on your working branch** — if it only lives on a hypothesis branch, bring it over (cherry-pick or merge) before deleting anything.
```bash
# 1) Salvage check: any commit listed here is UNIQUE to a scratch branch — move it to your
# working branch before cleanup, or you will lose it.
for h in h1 h2 h3; do git log --oneline "$BASE..ultrafix/$h" 2>/dev/null; done
# 2) Remove every debug probe you added.
grep -rn 'ULTRAFIX' .
# 3) Tear down the scratch worktrees, then delete the branches with -d (NOT -D).
# -d refuses to delete a branch with unmerged commits, so it can't silently drop the fix.
git worktree remove "../${ROOT##*/}-h1"
git worktree remove "../${ROOT##*/}-h2"
git worktree remove "../${ROOT##*/}-h3"
git branch -d ultrafix/h1 ultrafix/h2 ultrafix/h3 # errors here = unsalvaged commits; review before forcing
git worktree prune
```
Only fall back to `git branch -D` after you have confirmed the branch's unique commits are already on your working branch. Leave only the minimal fix (plus a regression test, if one was missing) on the working branch.
## Quick Reference
### The loop
Reproduce → hypothesize → isolate (worktree) → instrument → test → root cause → minimal fix → verify → clean up.
### Discipline
- **No fix without a reproduction.** If you can't trigger it, you can't verify it.
- **Evidence before fix.** Logs + isolation name the cause; you don't guess it.
- **One hypothesis per worktree.** Parallel probes, zero cross-contamination.
- **Minimal fix.** Address the root cause only; no drive-by refactors.
- **Verify by repetition** for flakes — one green run proves nothing.
### Cheatsheet
- Worktree: `git worktree add -b ultrafix/hN ../repo-hN <base>` → `git worktree remove <path>` → `git worktree prune`.
- Logging: tag every probe (`[ULTRAFIX hN] …`) so it greps cleanly and is trivial to delete; log inputs, branch taken, timing; add timestamps/thread for races.