Tooling

Meet SousChef, an Experiment in CyberChef Recipes from a Local LLM

· 7 min read

Meet SousChef, an Experiment in CyberChef Recipes from a Local LLM
Index5 sections

Anyone in the DFIR world can relate to this - you come across a command line that has a powershell -enc blob with seemingly a bagillion characters of Base64, and you know from experience there’s probably another layer or two underneath. This could involve compression via gzip, maybe a single-byte XOR using a key the script kindly left lying around. You then walk it through CyberChef by hand, something you’ve done a thousand times (and likely seen it throw invalid blah blah back at your face a similar amount). But it’s tedious…and exactly the kind of pattern-matching a language model is good at.

SousChef is a Python-based CLI tool I’ve been building to do that first pass for you. You hand it an obfuscated payload, it asks a local Ollama model what the recipe should look like, sanitizes and validates the model’s output against a known operation catalog, and hands you back a CyberChef URL with the recipe already loaded. The payload itself stays on-device.

It lives on GitHub at github.com/zerber0s/souschef. Fair warning up front though that it’s experimental - the prompt is still being tuned / battle-tested and there are some quirks (more on those below).

Why I built it

DFIR triage on encoded samples is a lot of mechanical work. Most “interesting” payloads I see in the wild aren’t doing anything novel cryptographically, they’re usually just stacking 3-4 well-known wrappers (base64 → UTF-16LE → gzip → XOR, etc.) and hoping the layering buys time. The slow part isn’t decoding any single layer, but identifying which layers are present and in what order. Then you leverage a tool like CyberChef to make it human-readable.

The other piece is sample sensitivity. Half of the obfuscated content I’d actually want a model’s opinion on (even malware) is stuff I can’t paste into a hosted API. This could be Client data or PII-adjacent and, usually being part of an active engagement, the unknowns need to limit how you handle the data. Knowing others face this same sceanrio, the design constraint was always “this has to run on the analyst’s machine, on a model the analyst controls.” The tool can even run against a local instance of CyberChef for the most sensitive of situations. Ollama running qwen3-coder:30b locally turned out to be a reasonable sweet spot on Apple Silicon: code-tuned and disciplined enough about structured output to produce parseable recipe JSON most of the time.

How it works

End-to-end, one run looks like this:

InputLocal modelParse & repairSanitizeNormalizeHeuristicsConfidenceOutput

Each step expanded:

  1. Input - a file, a stdin pipe, or a --input string. The same blob you’d paste into CyberChef.
  2. Local model - SousChef sends the payload plus a fairly large system prompt to Ollama. The system prompt encodes the CyberChef operation catalog (~122 ops) it can use, a set of few-shots (odd LLM lingo for examples) covering common DFIR patterns, and rules about argument formating / shape.
  3. Recipe parsing & repair - the model returns JSON. SousChef automatically handles fence markers, dangling brackets, and the usual LLM output noise, then parses out the recipe.
  4. Sanitization - anything that looks like a PowerShell execution sink (IEX, Invoke-Expression, trailing & calls) is stripped. These aren’t CyberChef ops, so if the model emits them, it’s confused about the boundary between “decode this” and “run this.”
  5. Argument normalization - coerces each op’s arguments into CyberChef’s exact positional format. This is the part that bit me hardest in early testing (see the “Where it is today” section below).
  6. Heuristic detectors - a panel of currently ~11 small checks runs over the recipe and a Python-side simulation of its output. They flag things like “the output is still mostly non-printable, you probably need another XOR layer” or “these two ops cancel each other out.”
  7. Confidence scoring - rolls everything up into HIGH / MEDIUM / LOW with a list of actionable signals.
  8. Output - assembles a CyberChef URL fragment, prints it, optionally copies it to the clipboard, optionally opens it in a browser.

The value this tool brings at a high-level:

🔒

Runs entirely offline

Samples are sent to a local Ollama model on your machine. No cloud APIs, no third-party telemetry.

🧪

Heuristic validation

A panel of small Python checks flags missing layers, redundant op pairs, and garbage output before you click the URL.

📚

Operation-catalog enforcement

Recipes are constrained to the known CyberChef op set. Hallucinated ops get caught at parse time, not in your browser.

📊

Confidence scoring

Every run produces a HIGH / MEDIUM / LOW signal with a short list of "why" and "what to check next."

🔗

Browser-ready URLs

Terminal output contains a CyberChef URL fragment with the recipe pre-loaded. Can be configured to auto-open in browser as well.

🛰️

Air-gap friendly

A --cyberchef flag points the URL at a self-hosted CyberChef instance for sensitive engagements.

What I tested it against

All testing was performed against a mix of benign sample data, generated by AI from known techniques / things I have seen in the field, and malicious samples pulled from public repositories such as VirusTotal.

A representative slice of what works end-to-end today:

FamilyShape
PowerShell -EncodedCommand / -encUTF-16LE base64 wrappers, with and without inner layers
Empire-style multi-layer$s1 + $s2 substitution + base64 + UTF-16LE + gzip + single-byte XOR
Invoke-Obfuscation COMPRESSReversed base64 + DeflateStream
AES-CBCAesCryptoServiceProvider with key/IV extraction
RC4Passphrase-keyed, base64-wrapped payloads
ChaCha20Stream-cipher payloads
Charcode + XOR@(N,N,N) | %{ [char]($_ -bxor $k) } patterns
Custom-alphabet base64Paired $std / $norm translation tables
Meterpreter format-string stagers-f operator with concatenation chains
Bare ROT13’d-base64 blobsInner base64 alphabet ROT13’d before encoding, SousChef auto-detects and prepends ROT13 to the recipe

Full coverage list, including patterns explicitly out of scope (cmd.exe DOSfuscation, raw shellcode disassembly, identifier-renaming-only obfuscation), lives in the SousChef README.

Most of my recent debugging time has gone into samples where the obfuscation pattern looked extremely similar but had a small twist (i.e. a custom base64 alphabet whose decode was silently falling back to the standard alphabet, or an RC4 sample where the key was hex-encoded one way and the model assumed another). Those cases actually produced perfect recipes that just…gave you garbage. They’re the reason that the heuristic detector layer exists at all (in addition to some iterative assistance from Claude Code).

Where it is today

Honest status, as of the time of this post:

  • Verified end-to-end on qwen3-coder:30b against the tested patterns. Smaller models (7B, 13B) do tend to degrade, but gracefully (they generate plausible recipes but miss the trickier multi-layer cases). Larger models work fine if you have the RAM.
  • Argument normalizer is critical, not cosmetic. CyberChef’s URL fragment parser expects positional arguments in an exact order, otherwise named-object arguments silently fall back to defaults. I was working alongside an unknown bug for a while where decoding custom-alphabet base64 looked successful but actually used the standard alphabet, only fixed by a stricter shape enforcement via the normalizer.
  • A few ops have non-obvious weird quirks. From Hex gets forced to the Auto delimiter (handles dashes, spaces, colons, line breaks) and I have no clue why. ROT13 and ROT47 are purposely not treated as terminal ops, since they’re legitimate middle steps in real chains. Find / Replace is forced to global matching to work around UI-vs-URL inconsistencies in CyberChef itself. All of these determined through testing (and pain).
  • Out of scope situations end with a graceful fallback. Bohannon-style cmd.exe DOSfuscation, raw shellcode disassembly, and identifier-renaming-only obfuscation don’t produce CyberChef recipes (even though I tried). In these and similar cases, the model is instructed to produce a Comment op explaining why instead of guessing.

All of the above is from me being only a few commits in. The system prompt is still the part most likely to change between sessions, which is also why I keep a more accessible static copy in the repo here. If you use this tool and something that worked yesterday doesn’t work today, the few-shot examples are the first place to look.

Try it

If you try it on a sample and the recipe is consistently wrong, or even sometimes, file an issue with the input (sanitized as needed), the model you used, and what you expected the recipe to be. That’s how this thing will continue to improve - every weird sample is a regression test waiting to be added that’ll only enhance the accuracy of future submissions.

I may update this post in the future, or write a follow-up, if this tool advances past the experimental phase.

ZB