DNABERT For Next Sentence Prediction

This project was using a Natural Language Processing model for improving DNA assembly: One of the main challenges in genomics is the determination of genome sequence using reads, which are DNA fragments. The standard procedure is to construct a graph using information about overlapping reads and then find a path through the graph which would represent the final sequence. However, there is no exact algorithms that could solve the graph in a reasonable amount of time. Therefore, deep learning methods and reinforcement learning algorithms for graph simplifications are used in this project to develop suitable and efficient models for locating critical patterns in graphs. This project aimed to use a Natural Language Processing model for DNA assembly at nodes on the graph to predict the next sequence node, hence helping to simplify the graph.

Features

New subclass for sentence prediction
Data preparation files for k-mers throught 3 different methods:
- Non-overlapping
- Sliding window by k-mer
- Sliding window by base
Subreads mapping to reference genome

Installation

Clone the DNABERT repo and install the dependencies on the server.

git clone https://github.com/jerryji1993/DNABERT
cd DNABERT
python3 -m pip install --editable
cd examples
python3 -m pip install -r requirements.txt

Download the pretrained models from here.

Data Preparation w/o Mapping

Run (k)mer_(method)_(train/dev)_(no. of records)rec.py to generate the respective datasets. For example, 6mer_bybase_dev_200rec.py will generate a .tsv file with:

6-mers
Sliding window by base method
An evaluation dataset (dev)
Using 200 records from subreads

Data Preparation w/ Mapping

Run trim_ecoli_reference.py to trim the original complete ecoli genome to 100,000 bases long. This generates trimmed_100000.fasta.

Execute the following command to get the matching output:

minimap2 -x map-hifi --secondary=no trimmed_100000.fasta SRR10971019.fastq > trimmed_mapping.paf

Use filtered_0.9_bybase_3mer.py to extract the records in trimmed_mapping.paf and generate the corresponding 3-mer datasets.

Finetuning DNABERT Model

Use the respective datasets generated to finetune the pretrained DNABERT model downloaded previously.

Note: The task name for sentence prediction is "sp" (line 5).

python -u run_finetune_withdebugmessages.py \
    --model_type dna \
    --tokenizer_name=dna$KMER \
    --model_name_or_path $MODEL_PATH \
    --task_name SP \
    --do_train \
    --do_eval \
    --data_dir $DATA_PATH \
    --max_seq_length 75 \
    --per_gpu_eval_batch_size=16   \
    --per_gpu_train_batch_size=16   \
    --learning_rate 2e-4 \
    --num_train_epochs 3.0 \
    --output_dir $OUTPUT_PATH \
    --evaluate_during_training \
    --logging_steps 100 \
    --save_steps 4000 \
    --warmup_percent 0.1 \
    --hidden_dropout_prob 0.1 \
    --overwrite_output \
    --weight_decay 0.01 \
    --n_process 8 > filtered_0.9_bybase_3mer.txt

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
DNABERT		DNABERT
data		data
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DNABERT For Next Sentence Prediction

Features

Installation

Data Preparation w/o Mapping

Data Preparation w/ Mapping

Finetuning DNABERT Model

About

Uh oh!

Releases

Packages

Languages

ws500981/finalcode

Folders and files

Latest commit

History

Repository files navigation

DNABERT For Next Sentence Prediction

Features

Installation

Data Preparation w/o Mapping

Data Preparation w/ Mapping

Finetuning DNABERT Model

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages