This respository uses biodiversity data from the BioTIME database to classifiy methods texts using fastText.
- a local copy of BioTIME and the metadata.
- conda (Miniconda or Anaconda)
- a Python fastText binding (more information in the installation section)
This sections guides you to set up this project to run experiments using Snakemake and fastText.
$ git clone https://github.com/komax/BioTIME-fastText-classification- Create a new environment, e.g.,
biotime-fasttextand install all dependencies.
$ conda env create --name biotime-fasttext --file environment.yaml- Activate the conda environment. Either use the anaconda navigator or use this command in your terminal:
$ conda activate biotime-fasttextor
$ source activate biotime-fasttextDisclaimer: you can use pip install fasttext in your anaconda environment, but those bindings are outdated.
I recommend doing this: 0. First activate your anaconda environment.
- Checkout the github respository from fastText or a stable fork:
$ git clone https://github.com/komax/fastText- Install the python bindings in the fastText respository
pip install .Create a symlink or copy your BioTIME data into biotime directory.
nltk requires to download content to tokenize a sentence. Run this in your python shell:
>>> import nltk
>>> nltk.download('punkt')or run
$ python scripts/download-nltk-punkt.pyAll configuration parameters are stored in Snakefile. Change the parameters to your purpose.
Adjust -j <num_cores> in your snakemake calls to make use of multiple cores to run at the same time.
$ snakemake normalize_fasttextCreate data for cross validation, split the model parameters up in blocks and sort the model parameters by f1 scores on the training data.
$ snakemake sort_f1_scoresSelect the best model (from the cross validation) and train it
$ snakemake train_model$ snakemake test_model$ snakemakeSnakemake can visualize the workflow using dot. Run the following to generate a png for the workflow.
$ snakemake --dag all | dot -Tpng > dag.pngCheckout the Snakefile and adjust this section to configure the experimental setup (parameter selection, cross validation, parallelization):
KFOLD = 2
TEST_SIZE = 0.25
CHUNKS = 4
PARAMETER_SPACE = ModelParams(
dim=ParamRange(start=10, stop=100, num=2),
lr=ParamRange(start=0.1, stop=1.0, num=2),
wordNgrams=ParamRange(start=2, stop=5, num=2),
epoch=ParamRange(start=5, stop=50, num=2),
bucket=ParamRange(start=2_000_000, stop=10_000_000, num=2)
)
FIRST_N_SENTENCES = 1The (sub)directory data contains intermediate data from data transforms/selection, chunking of the parameter space data/blocks and subsampling for cross validation data/cv.
results entails the parameterization for the experiments as well as the accurancy scores measured as f1 scores on precision and recall:
results/blockscontains all chunks (inlcuding the validation scores) ascsvs,results/params_scores.csvis the concatenation of all blocks,results/params_scores_sorted.csvranks the resulting scores by thef1_cross_validation_microscore on the cross validation sets per label. Then, we select the model with the smallestf1_cross_validation_micro_ptpwith smallest point to point distance (minimum value to maximium value)