Skip to content
/ phishML Public

phishML is a CLI first phishing detection pipeline for raw RFC 5322 .eml emails. It parses messages, extracts security focused features, builds datasets, trains a simple baseline model, then scores new emails offline. It supports building datasets from .eml folders or CSV text sources, and merges feature sets for repeatable experiments.

Notifications You must be signed in to change notification settings

obsTR/phishML

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

phishML

Phishing detection pipeline for raw .eml emails. CLI-first: parse -> feature extraction -> dataset -> baseline model -> scoring.

Quickstart

python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txt
pip install -e .

# Parse a single email
python -m phishml.cli parse --input .\data\sample.eml --output .\out.json

# Build dataset from folders (expects data/raw/phish and data/raw/ham)
python -m phishml.cli build-dataset --input-dir .\data\raw --output .\data\features.csv

# Build dataset from CSV text sources
python -m phishml.cli build-dataset-text --input-dir .\data\csv --output .\data\features_text.csv

# Merge multiple datasets
python -m phishml.cli merge-datasets --inputs .\data\features_text.csv .\data\features_eml.csv --output .\data\features_all.csv

# Train baseline model
python -m phishml.cli train --dataset .\data\features.csv --model-out .\models\baseline.pkl

# Score a single email
python -m phishml.cli score --model .\models\baseline.pkl --input .\data\sample.eml

Dataset layout

 data/
   raw/
     phish/   # phishing .eml
     ham/     # benign .eml

Notes

  • Input is raw RFC 5322 .eml.
  • No PII restrictions assumed, but the pipeline can be adapted to hash/strip fields.
  • URL parsing avoids network calls; TLD detection is heuristic.

About

phishML is a CLI first phishing detection pipeline for raw RFC 5322 .eml emails. It parses messages, extracts security focused features, builds datasets, trains a simple baseline model, then scores new emails offline. It supports building datasets from .eml folders or CSV text sources, and merges feature sets for repeatable experiments.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages