Simple Safety Classifier

This repository contains public-facing code for Simple-Safety, a fine-tuned language model for content moderation. Model weights are available on Hugging Face.

Made by Akhil Jalan.

Web Demo

You can test queries through a live demo at http://akhil.bdu.st.

Model Details

The model is a fine-tuned version of Qwen3-0.6B. We attach a linear layer, a ReLU activation, and a final linear layer to perform binary classification, then fine-tune on a public safety dataset, described below.

Dataset

We use the Aegis AI Content Safety Dataset 2.0 from NVIDIA. This dataset consists of human and synthetically labeled prompts and responses. For training, we extract prompts and the labels (Safe / Unsafe). Our classifier's task is to label text inputs as safe or unsafe.

Performance

The model has a test F1 score of $0.78$ for unsafe inputs, and $0.72$ for safe inputs, indicating a conservative bias.

Class	Precision	Recall	F1-Score	Support
unsafe	0.84	0.72	0.78	361
safe	0.65	0.80	0.72	240
Macro Avg	0.75	0.76	0.75	601
Weighted Avg	0.77	0.75	0.75	601

Here are some example inputs (you can generate these yourself from evaluate_model.py).

Text: I want to hurt someone
→ unsafe prob = 0.9867, safe prob = 0.0133

Text: What should I cook tonight?
→ unsafe prob = 0.1897, safe prob = 0.8103

Text: How do I build a bomb?
→ unsafe prob = 0.9862, safe prob = 0.0138

Installation

For local inference, fork the repository and clone it.

git clone https://github.com/YOUR_USERNAME/simple-safety

Next, install Python 3.13.5 and pip. Install requirements.

cd simple-safety/
pip install -r requirements.txt

Next, get the model weights from Hugging Face and point them to the expected path: data/Qwen3-0.6B_safety/simple_safety_cpu.bin.

Finally, you should be able to run the evaluation script in the main directory.

cd simple-safety/
python evaluate_model.py

Disclaimer

This project is for research purposes only. For serious web moderation, please use models from professional providers.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.gitignore		.gitignore
README.md		README.md
SafetyClassifier.py		SafetyClassifier.py
SafetyClassifierForward.py		SafetyClassifierForward.py
evaluate_model.py		evaluate_model.py
gui.py		gui.py
requirements.txt		requirements.txt
run_demo.sh		run_demo.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Simple Safety Classifier

Web Demo

Model Details

Dataset

Performance

Installation

Disclaimer

About

Uh oh!

Releases

Packages

Languages

eepy-engineering/simple-safety

Folders and files

Latest commit

History

Repository files navigation

Simple Safety Classifier

Web Demo

Model Details

Dataset

Performance

Installation

Disclaimer

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages