A machine learning solution for predicting Reddit comment rule violations using fine-tuned large language models. This project implements a comprehensive pipeline for data augmentation, model training, and ensemble inference to achieve high performance on the Jigsaw Agile Community Rules competition.
This solution addresses the challenge of building models that can predict whether Reddit comments violate specific subreddit rules. The key challenge is developing a flexible model capable of generalizing to rules not present in the training data.
- Fine-tuning of Qwen2.5-7B-Instruct model using Unsloth
- Data augmentation pipeline for 3x training data increase
- Multi-model ensemble with test-time augmentation
- Comprehensive experiment tracking and configuration management
- Cross-validation strategy respecting rule distribution
- Initial baseline: 0.81 AUC
- Current leaderboard score: 0.867 AUC
- Target performance: 0.90+ AUC
- Python 3.8 or higher
- CUDA-compatible GPU (recommended: 16GB+ VRAM)- kaggle notebooks(2X T4 gpus's)
- Git
- Clone the repository:
git clone https://github.com/yourusername/jigsaw-competition.git
cd jigsaw-competition- Run the setup script:
chmod +x run.sh
./run.sh setupThis will:
- Create a virtual environment
- Install all dependencies
- Create necessary directories
Run the full experiment pipeline:
./run.sh experiment- Data Augmentation:
./run.sh augment- Training:
./run.sh train --config baseline_v2 --gpu 0- Inference:
./run.sh inference --gpu 0- Ensemble:
./run.sh ensemble- Run Tests:
./run.sh test├── .github/
│ └── workflows/
│ └── ci.yml # CI/CD pipeline
├── src/
│ ├── __init__.py
│ ├── training.py # Basic training script
│ ├── inference.py # Inference pipeline
│ ├── improved_training.py # Enhanced training with CV
│ ├── data_augmentation.py # Data preprocessing
│ ├── ensemble_inference.py # Multi-model ensemble
│ ├── config.py # Configuration management
│ └── run_experiments.py # Full pipeline orchestration
├── test/
│ ├── __init__.py
│ ├── test_data_processing.py
│ ├── test_model_utils.py
│ └── test_config.py
├── run.sh # Main execution script
├── requirements.txt # Python dependencies
└── README.md # This file
Edit src/config.py to customize training parameters:
ExperimentConfig:
model_name: str = "unsloth/Qwen2.5-7B-Instruct"
lora_r: int = 32 # LoRA rank
max_steps: int = 500 # Training steps
learning_rate: float = 2e-4 # Learning rate
use_augmented_data: bool = True # Use augmented data
prompt_template: str = "template_v2" # Prompt versionrow_id: Unique identifierbody: Comment textrule: Rule being evaluatedsubreddit: Source subredditpositive_example_1,positive_example_2: Examples of rule violationsnegative_example_1,negative_example_2: Examples of non-violationsrule_violation: Binary target (1 = violation, 0 = no violation)
Same format as training data but without rule_violation column.
- Base Model: Qwen2.5-7B-Instruct
- Fine-tuning: LoRA (Low-Rank Adaptation)
- Quantization: 4-bit for memory efficiency
- Max Sequence Length: 2048 tokens
- Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
- Cross-Validation: 5-fold stratified by rule+subreddit
- Data Augmentation: Paraphrasing to triple dataset size
- Hyperparameter Optimization: Grid search over LoRA rank, learning rate
- Extended Training: 500+ steps vs baseline 60 steps
- Prompt Engineering: Multiple template variations
- Model Loading: Load fine-tuned checkpoints
- Prompt Formatting: Apply chat template with examples
- Constrained Generation: Force Yes/No outputs
- Probability Extraction: Get logprobs for binary classification
- Ensemble Averaging: Combine multiple model predictions
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Unsloth team for the efficient fine-tuning framework
- Hugging Face for model hosting and tools
- Kaggle for hosting the competition
- Reddit communities for the data
For questions or collaboration, please open an issue on GitHub.
