EPBoost

EPBoost is a quick and accurate method to identify enhancer-promoter interactions using intrinsic features generated from genomic sequences. It exploits the kmer content counts of the sequences as inputs and trains and predicts with a CatBoost model.

Material

To evaluate the performance of the model we extracted interaction data of 12 cell lines from TargetFinder https://github.com/shwhalen/targetfinder and DeepTACT https://github.com/liwenran/DeepTACT.

Usage

FilePath

When using EPBoost, the actual filepaths should be set properly. Take cell line NHEK as an example:

EPBoost

EPBoost_Test.py
Predict.py
DataPrepare.py
EPBoost_Train.py
EPBoost2_Train.py

dataset

DeepTACT
TargetFinder

NHEK

Training

STEP1

$ python Dataprepare.py cell_line
In this process we will pad input enhancers into 3000bp long and promoters into 2000bp long , the cell_line refers to the name of the cell line:
only one file is needed: dataset/TargetFinder(or DeepTACT)/celllinename/pairs.csv
and three files will be produced: enhancers.bed, promoters.bed, train.csv

Example

$ python Dataprepare.py NHEK

STEP2

$ python EPBoost_Train.py k cell_line
This is the training program, a 10-fold validation is also included. The k determines the length of the kmer which can be ranged from 3 to 7, the cell_line refers to the name of the cell line, the imbalance ratio in training set and test set are both 1:20. In the process, profiles of enhancers.bed, promoters.bed, train.csv are needed and enhancers.fa, promoters.fa, enhancers.txt, promoters.txt, training.txt are intermediate processing files. At last, a best_model will be generated and saved.
$ python EPBoost2_Train.py k cell_line
This is the training program to compare with DeepTACT, a 10-fold validation is also included. The k determines the length of the kmer which can be ranged from 3 to 7, the cell_line refers to the name of the cell line, the imbalance ratio in training set is 1:20 and in test set is 1:5. In the process, profiles of enhancers.bed, promoters.bed, train.csv are needed and enhancers.fa, promoters.fa, enhancers.txt, promoters.txt, training.txt are intermediate processing files. At last, a best_model will be generated and saved.

Example

$ python EPBoost_Train.py 3 NHEK

Note

In processes counting and normalizing the kmer contents, we basically adapted the code in seer_py which is originally from https://github.com/CalabreseLab/seekr. We use hg19.fa file as a reference genome which can be downloaded by

$ wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
$ gzip -d hg19.fa.gz

Test

$ python EPBoost_Test.py k model_cell_line test_cell_line
This is the test program, the k determines the length of the kmer which can be ranged from 3 to 7, the model_cell_line defines the trained model we use for predicting, the test_cell_line refers to the cell line we would like to make a prediction. eg:$ python EPBoost_Test.py 3 GM12878 NHEK

Example

$ python EPBoost_Test.py 3 NHEK GM12878

Predict

$ python Predict.py k cell_line enchrome enstart enend prchrome prstart prend
This is the predicting program, the k determines the length of the kmer which can be ranged from 3 to 7 (here we provide a model setting k at 3), the cell_line defines the trained model we use for predicting, the enchrome, enstart, enend, prchrome, prstart, prend refer to the locations of the enhancer and promoter we would like to make a prediction, respectively.

Example

$ python3 Predict.py 3 NHEK chr1 3399800 3400600 chr1 3541200 3542000
Output: For Promoter NHEK|chr1:3399800-3400600, Enhancer NHEK|chr1:3541200-3542000 in cell line NHEK : The two elements are predicted interacted by EPBoost, the interaction prediction score is 0.99766.
$ python3 Predict.py 3 NHEK chr1 10000000 10001000 chr1 10004000 10005000
Output: For Promoter NHEK|chr1:10000000-10001000, Enhancer NHEK|chr1:10004000-10005000 in cell line NHEK : The two elements are predicted not to be interacted by EPBoost, the interaction prediction score is 0.0001.
$ python3 Predict.py 3 NHEK chr1 10000000 10001000 chr2 10004000 10005000
Output: The two elements are not in the same chrosome, please recheck your input!

Requirements

Python (run on 3.6.8)
scikit-learn (run on 0.21.3)
numpy (run on 1.16.2)
bedtools (run on 2.28.0)
catboost (run on 0.20)
matplotlib (run on 3.1.2)
tqdm (run on 4.38.0)

License

EPBoost is licensed under the MIT License - details can be found in the LICENSE.md file

Download

Please download the related profiles from https://github.com/ACatfromUSTC/EPBoost/.

Name		Name	Last commit message	Last commit date
Latest commit History 145 Commits
EPBoost		EPBoost
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EPBoost

Material

Usage

FilePath

Training

STEP1

Example

STEP2

Example

Note

Test

Example

Predict

Example

Requirements

License

Download

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EPBoost

Material

Usage

FilePath

Training

STEP1

Example

STEP2

Example

Note

Test

Example

Predict

Example

Requirements

License

Download

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages