Hate Speech Detection in Portuguese

Project Overview

This project aims to develop a Portuguese-language tool for detecting hate speech. The goal is to create a robust and reliable model that can be effectively employed in identifying and analyzing hateful content in textual data.

This project was developed by two students from UFPB: Pedro Augusto Medeiros (GitHub) and Tales Nobre (GitHub).

Datasets

The repository contains:

Original datasets used in training.
Processed and concatenated versions of these datasets.
Synthetic data voluntarily contributed by other students from the Centro de Informática at UFPB.

Training Logs

Below are the details of the training logs included in this repository:

primeiro_log: Training conducted on a dataset created by merging two original datasets, trained for 10 epochs.
segundo_log: Training performed on the same dataset as primeiro_log, but with a larger batch size.
terceiro_log: Training performed on an augmented dataset, with fewer steps between each validation and trained for 5 epochs.
quarto_log: Training performed on an augmented and post-processed dataset, trained for 5 epochs.
log_tokenizador_pt: Training conducted using a pre-trained BERT tokenizer specifically for Portuguese.

Model Information

The trained models are not included in this repository due to GitHub storage limitations. Instead, they are hosted on Google Drive. All models were trained using the BERT base model from the Hugging Face library.

The top-performing models are:

Top1: Trained on the augmented and processed dataset, checkpoint 990.
Top2: Trained on the augmented dataset, checkpoint 820.
Top3: Trained on the non-augmented dataset, checkpoint 1230.

For access to the models, please refer to the provided Google Drive link or contact the project maintainers.

License

This project is released under an open-source license. Please review the LICENSE file for more details.

For any questions or further inquiries, please contact the project maintainers or contribute via GitHub.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
train_logs		train_logs
README.md		README.md
inputs.ipynb		inputs.ipynb
inputs_original.ipynb		inputs_original.ipynb
model.ipynb		model.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Hate Speech Detection in Portuguese

Project Overview

Datasets

Training Logs

Model Information

License

About

Uh oh!

Releases

Packages

Languages

pedroaugvsto/DescoBERTo

Folders and files

Latest commit

History

Repository files navigation

Hate Speech Detection in Portuguese

Project Overview

Datasets

Training Logs

Model Information

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages