AURORA is a research & engineering codebase for training and serving a multi-task machine‑learning model on Wikipedia text streams. The project demonstrates clean architecture, reproducible experiments, an end-to-end training/evaluation pipeline, and a lightweight FastAPI/CLI deployment.
- Data engineering: cleaning, deduplication, anomaly checks, anonymization
- Training: multi-task network (sentiment classification + intensity regression)
- Evaluation: accuracy, precision, recall, F1, intensity MAE on held‑out data
- Optimization: automated hyperparameter search using configuration files
- Serving: FastAPI endpoint and CLI chat interface for inference
shion_ai/
├── app.py # command-line entrypoint
├── api/server.py # FastAPI web server
├── aurora/ # core library
│ ├── core/ # algorithm implementations
│ ├── data/ # dataset utilities & streaming
│ ├── training/ # train / eval / optimize scripts
│ ├── serving/ # prediction and reply helpers
│ └── utils/ # configuration & logging helpers
├── cli/chat.py # simple chat client
├── configs/ # YAML configuration files
├── data/ # wiki extracts + generated gold sets
├── docs/ # supplementary documentation
├── artifacts/ # checkpoints, reports, outputs
└── requirements.txt # Python dependencies
- Clone the repository:
git clone https://github.com/Kevin28576/shion_ai.git cd shion_ai - Create a virtual environment and install dependencies:
python -m venv venv source venv/bin/activate # or `. venv\Scripts\activate` on Windows pip install -r requirements.txt
Most functionality is accessed via the main CLI wrapper in app.py.
Examples:
python app.py train --config configs/base.yaml
python app.py eval --config configs/base.yaml
python app.py optimize --config configs/base.yaml
python app.py serve --reload # start FastAPI server
python app.py chat # interactive command‑line chatGenerate or evaluate a gold dataset:
python app.py build-gold --config configs/base.yaml \
--out data/gold/gold_candidates.jsonl --size 500
python app.py eval-gold --config configs/base.yaml \
--gold data/gold/gold_candidates.jsonl- Input data: place WikiExtractor output under
data/extracted/. - Splits happen on-the-fly using hashing; no static
processedfiles are required. - Reports are written under
artifacts/reports/.
Contributions are welcome! Please open issues or pull requests. Follow the Code of Conduct and describe your changes clearly.
- Fork the repo.
- Create a feature branch:
git checkout -b feature/your-feature - Commit your changes, push, and open a PR.
See CONTRIBUTING.md (to be added) for more details.
This project is released under the MIT License.
This README was automatically generated for clarity on GitHub.