Skip to content

A experimental text, base64, binary processing, labeling, and comparison tool.

License

Notifications You must be signed in to change notification settings

Obsecurus/precursor

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

69 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

precursor

Precursor logo

Live demos · Scenario corpus · Crates.io

precursor is a CLI for pre-protocol payload tagging + similarity clustering across packet payloads, logs, and raw binary fragments. It combines PCRE2 named-capture matching, TLSH/LZJD/FBHash similarity, optional MRSHv2 adapter mode, and JSON outputs designed for fast triage loops before full parser or protocol tooling exists.

Release 0.2.1 Highlights

Area What landed
Packet Inference Single-packet protocol scoring via -P / -A / -k
Blob + Binary Processing -z, --input-blob and -B, --input-binary for multiline and raw-byte stream analysis
Similarity Workflows TLSH, LZJD, or FBHash clustering + protocol hints (--protocol-hints) for discovery loops
Sigma Compatibility --sigma-rule converts selectors into named PCRE captures and enforces Sigma condition logic
Regex Engine Scaffold `--regex-engine pcre2
Output Contract Stable protocol_*, similarity_hash, tags, xxh3_64_sum JSON fields
Reliability Runtime ingest path no longer relies on panic-prone expect(...) calls
Scenario Corpus Versioned packet/firmware/ICS + public PCAP/log/Sigma-derived samples in samples/scenarios/
Release Ops Dependency auto-bump/tag workflows + benchmark harness + Pages site

60-second teaser

cat samples/scenarios/pre-protocol-packet-triage/payloads.b64 \
  | precursor -p samples/scenarios/pre-protocol-packet-triage/patterns.pcre \
      -m base64 -t -d --similarity-mode lzjd -P --protocol-hints

Representative output shape:

{
  "tags": ["http_method"],
  "similarity_hash": "lzjd:128:...",
  "protocol_label": "http",
  "protocol_confidence": 0.93,
  "protocol_candidates": [
    {"protocol": "http", "score": 0.93, "evidence": ["matched HTTP request/headers"]}
  ],
  "xxh3_64_sum": "..."
}

Why this matters:

  • Fast pre-protocol triage when DPI/parsers are unavailable.
  • Stable JSON fields for SOC pipelines and enrichment tooling.
  • Built-in cluster context for LLM-assisted protocol discovery.

Best Fit

  • Rapid triage of payload lines from logs, brokers, sensors, or ad-hoc captures.
  • Label-first workflows where regex capture names become downstream tags.
  • Similarity-first clustering when full protocol parsers are unavailable or too brittle.
  • Early-stage protocol discovery where hints are fed to humans or LLM tooling.

Non-Goals

  • Not a replacement for full IDS/NSM stacks (Suricata, Zeek).
  • Not a malware rule engine replacement (YARA / YARA-X).
  • Not a full protocol parser stack; this is pre-protocol triage and clustering.

Architecture

Precursor architecture diagram

Install

Cargo

cargo install precursor

From source

git clone https://github.com/Obsecurus/precursor.git
cd precursor
cargo build --release
./target/release/precursor --help

Release archives include generated shell completion files (bash, fish, zsh, powershell) and a precursor.1 man page.

Optional: Build with MRSHv2 adapter mode

mock_dir="$(mktemp -d)"
ci/build_mrshv2_mock.sh "$mock_dir"
PRECURSOR_MRSHV2_LIB_DIR="$mock_dir" cargo build --features similarity-mrshv2

Quick start

1) Match string payloads from stdin

printf 'hello world\nbye world\n' \
  | precursor '(?<hello_tag>hello)' -m string

2) Match base64 payloads

printf 'aGVsbG8gd29ybGQ=\n' \
  | precursor '(?<hello_tag>hello)' -m base64

3) Load patterns from file

printf 'aGVsbG8gd29ybGQ=\n' \
  | precursor -p patterns/new -m base64

4) Extract payload from JSON before matching

printf '{"payload":"aGVsbG8gd29ybGQ="}\n' \
  | precursor '(?<hello_tag>hello)' -j '.payload' -m base64

5) Enable similarity diffing (TLSH default)

cat payloads.b64 \
  | precursor -p patterns/new -m base64 -t -d -x 80

6) Switch to LZJD similarity mode

cat payloads.raw \
  | precursor -p patterns/new -m string -t -d --similarity-mode lzjd -x 80

7) Switch to FBHash similarity mode

cat payloads.raw \
  | precursor -p patterns/new -m string -t -d --similarity-mode fbhash -x 90

8) Emit protocol-discovery hints for an LLM loop

cat payloads.b64 \
  | precursor -p patterns/new -m base64 -t -d --protocol-hints --protocol-hints-limit 20

9) Enable single-packet protocol inference output

cat payloads.b64 \
  | precursor -p patterns/new -m base64 -P -A 0.7 -k 5

10) Match a multiline payload as one blob

printf 'GET /blob HTTP/1.1\nHost: blob.example\n' \
  | precursor '(?<multi>GET /blob HTTP/1\.1\nHost: blob\.example)' -m string -z

11) Match a raw-binary blob (short flag)

printf '\x7fELF\x02\x01\x01\x00' \
  | precursor '(?<elf_magic>^\x7fELF)' -B

12) Load Sigma rules with condition gating

cat samples/scenarios/sigma-linux-shell-command-triage/payloads.log \
  | precursor --sigma-rule samples/scenarios/sigma-linux-shell-command-triage/sigma_rule.yml \
      -m string -t -d --similarity-mode lzjd

13) Run vectorscan compatibility mode (PCRE2 fallback)

cat payloads.raw \
  | precursor '(?<http_get>GET)' -m string --regex-engine vectorscan

14) Replay a real public Log4Shell PCAP (FBHash mode)

cat samples/scenarios/public-log4shell-foxit-pcap/payloads.string \
  | precursor -p samples/scenarios/public-log4shell-foxit-pcap/patterns.pcre \
      -m string -t -d --similarity-mode fbhash -P --protocol-hints

15) Triage public firmware blobs in binary folder mode

precursor -p samples/scenarios/public-firmware-binwalk-magic/patterns.pcre \
  -f samples/scenarios/public-firmware-binwalk-magic/blobs \
  --input-mode binary -t -d --similarity-mode lzjd -P --protocol-hints

CLI reference

precursor [PATTERN] [OPTIONS]

At least one pattern source is required: positional PATTERN, --pattern-file, or --sigma-rule.

Pattern source:

  • positional PATTERN (single named-capture regex)
  • -p, --pattern-file <PATH> (one named-capture pattern per line)
  • --sigma-rule <PATH> (Sigma YAML selectors converted to named-capture PCRE patterns with condition enforcement)

Input:

  • -f, --input-folder <PATH>: read newline-delimited content from files
  • stdin: read newline-delimited input from standard input
  • -z, --input-blob: process each input source as one blob instead of line splitting
  • -B, --input-binary: treat each source as raw binary bytes (implies blob processing semantics)
  • -m, --input-mode <base64|string|hex|binary>: decode mode (default: base64)
  • -j, --input-json-key <QUERY>: extract payload from JSON input first

Similarity:

  • -t, --tlsh: compute TLSH hash for matched payloads
  • -d, --tlsh-diff: compute pairwise TLSH distance among matched payloads
  • -a, --tlsh-algorithm <48_1|128_1|128_3|256_1|256_3>
  • -x, --tlsh-distance <N>: max distance threshold (default: 100)
  • -l, --tlsh-length: include payload length in diff scoring
  • -y, --tlsh-sim-only: only output payloads that have TLSH similarities
  • --similarity-mode <tlsh|lzjd|mrshv2|fbhash>:
    • tlsh, lzjd, and fbhash are implemented in default builds
    • mrshv2 is implemented behind --features similarity-mrshv2 and native adapter linking
    • fbhash currently uses an in-tree FBHash-inspired chunk-vector model for stream-friendly pairwise scoring
  • --protocol-hints: emit LLM-oriented protocol-discovery hint JSON to stderr
  • --protocol-hints-limit <N>: limit hint candidate count (default: 25)
  • -P, --single-packet: enable heuristic protocol inference on each matched payload
  • -A, --abstain-threshold <0.0-1.0>: minimum confidence required to emit a non-unknown label (default: 0.65)
  • -k, --protocol-top-k <N>: candidate count included in protocol_candidates (default: 3)
  • --regex-engine <pcre2|vectorscan>: regex engine selection (vectorscan mode emits compatibility checks and executes through current PCRE2 path)

Other:

  • -s, --stats: emit run statistics JSON to stderr

Output model

Each matched payload is emitted as JSON on stdout with fields such as:

  • tags: array of matched capture names
  • tlsh: active similarity hash when enabled (legacy field name preserved for compatibility)
  • similarity_hash: backend-agnostic similarity hash field
  • xxh3_64_sum: stable payload key for report correlation
  • tlsh_similarities: distance map when --tlsh-diff is enabled
  • protocol_label: top protocol guess (or unknown when abstaining)
  • protocol_confidence: confidence score for protocol_label
  • protocol_abstained: whether inference abstained under threshold
  • protocol_candidates: scored candidate list with evidence strings
  • sigma_rule_matches: Sigma rule titles whose condition evaluated true (when --sigma-rule is used)
  • sigma_rule_ids: stable Sigma rule IDs/slugs that evaluated true

When --stats is enabled, a summary JSON object is emitted to stderr. See STATS.md for schema, field meanings, and jq examples. When --protocol-hints is enabled, an additional hint JSON block is emitted to stderr for LLM-guided protocol discovery workflows, including protocol_* fields when single-packet inference is enabled. When both --single-packet and --tlsh-diff are enabled, protocol confidence is cluster-boosted using similarity neighbor counts. When --input-blob is enabled (or --input-binary is set), each file/stdin stream is treated as a single candidate payload.

Stats quick view

cat payloads.b64 \
  | precursor -p patterns/new -m base64 -t -d --similarity-mode lzjd --stats \
  1>/tmp/records.ndjson 2>/tmp/stats.json

jq '.Environment + {input_count: .Input.Count, similarities: .Compare.Similarities}' /tmp/stats.json

Notes:

  • Compare may be empty when too few matched payloads produce pairwise distances.
  • Historical record field names like tlsh_similarities are preserved for compatibility across TLSH/LZJD/FBHash modes.

Positioning vs adjacent tools

  • Use Suricata/Zeek for full protocol-aware IDS/NSM and rich ecosystem integrations.
  • Use YARA/YARA-X for signature-based scanning of files and malware-centric workflows.
  • Use Sigma for backend-agnostic detection content and SIEM portability.
  • Use Precursor when you need lightweight payload tagging + similarity clustering, or when you want to run Sigma keyword intent directly against raw payload streams via --sigma-rule.

Scenario corpus and demos

  • Scenario corpus: samples/scenarios/
  • Demo runner: samples/scenarios/run_all.sh
  • Static demo site source: site/
  • Includes packet/firmware/ICS plus public PCAP-derived Log4Shell probes, real fox-it Log4Shell PCAP replay, Sigma shell-command triage, Zeek DNS log triage, and real binwalk firmware blob samples.
  • Site includes mini replay reels that visually walk through PCAP replay, firmware blob triage, and Sigma labeling behavior.
  • tshark is only required when regenerating PCAP-derived payload extracts.
samples/scenarios/run_all.sh ./target/release/precursor

Benchmarks

Generate a reproducible scenario snapshot:

cargo build --release
ci/benchmark_scenarios.sh ./target/release/precursor benchmarks/latest.md

Committed baseline:

  • benchmarks/baseline-2026-02-13.md

GitHub Pages + precursor.hashdb.io

  • Pages workflow: .github/workflows/pages.yml
  • Site content: site/
  • Custom domain file: site/CNAME
  • Configure DNS at your provider with:
    • record type: CNAME
    • host: precursor
    • value: obsecurus.github.io

Current roadmap

See ROADMAP.md for prioritized milestones and release criteria. See SIMILARITY_BACKENDS.md for MRSHv2/FBHash feasibility and backend sequencing. See SIGMA_INTEGRATION.md for Sigma feature coverage and next steps. See HARDWARE_ACCELERATION.md for regex acceleration/offload strategy. See STATS.md for run-statistics schema and usage guidance.

Development

cargo fmt --all --check
cargo test --workspace
mock_dir="$(mktemp -d)"
ci/build_mrshv2_mock.sh "$mock_dir" >/dev/null
PRECURSOR_MRSHV2_LIB_DIR="$mock_dir" LD_LIBRARY_PATH="$mock_dir:${LD_LIBRARY_PATH:-}" cargo test --workspace --features similarity-mrshv2

CI currently tests multiple toolchains and targets, including a pinned Rust 1.86.0 lane. Dependabot is configured for weekly Cargo and GitHub Actions updates, with auto patch-version bump + auto-tag workflows so dependency updates can flow into release builds.

Background

License

Dual-licensed under MIT and Unlicense.

About

A experimental text, base64, binary processing, labeling, and comparison tool.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Rust 70.4%
  • JavaScript 10.2%
  • Shell 8.2%
  • CSS 4.8%
  • HTML 4.5%
  • C 1.4%
  • Other 0.5%