EPBoost is a quick and accurate method to identify enhancer-promoter interactions using intrinsic features generated from genomic sequences. It exploits the kmer content counts of the sequences as inputs and trains and predicts with a CatBoost model.
To evaluate the performance of the model we extracted interaction data of 12 cell lines from TargetFinder https://github.com/shwhalen/targetfinder and DeepTACT https://github.com/liwenran/DeepTACT.
When using EPBoost, the actual filepaths should be set properly. Take cell line NHEK as an example:
EPBoost
EPBoost_Test.py
Predict.py
DataPrepare.py
EPBoost_Train.py
EPBoost2_Train.py
datasetDeepTACT
TargetFinderNHEK
$ python Dataprepare.py cell_line
In this process we will pad input enhancers into 3000bp long and promoters into 2000bp long , the cell_line refers to the name of the cell line:
only one file is needed: dataset/TargetFinder(or DeepTACT)/celllinename/pairs.csv
and three files will be produced: enhancers.bed, promoters.bed, train.csv
$ python Dataprepare.py NHEK
$ python EPBoost_Train.py k cell_line
This is the training program, a 10-fold validation is also included. The k determines the length of the kmer which can be ranged from 3 to 7, the cell_line refers to the name of the cell line, the imbalance ratio in training set and test set are both 1:20. In the process, profiles of enhancers.bed, promoters.bed, train.csv are needed and enhancers.fa, promoters.fa, enhancers.txt, promoters.txt, training.txt are intermediate processing files. At last, a best_model will be generated and saved.$ python EPBoost2_Train.py k cell_line
This is the training program to compare with DeepTACT, a 10-fold validation is also included. The k determines the length of the kmer which can be ranged from 3 to 7, the cell_line refers to the name of the cell line, the imbalance ratio in training set is 1:20 and in test set is 1:5. In the process, profiles of enhancers.bed, promoters.bed, train.csv are needed and enhancers.fa, promoters.fa, enhancers.txt, promoters.txt, training.txt are intermediate processing files. At last, a best_model will be generated and saved.
$ python EPBoost_Train.py 3 NHEK
In processes counting and normalizing the kmer contents, we basically adapted the code in seer_py which is originally from https://github.com/CalabreseLab/seekr.
We use hg19.fa file as a reference genome which can be downloaded by
$ wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz$ gzip -d hg19.fa.gz
$ python EPBoost_Test.py k model_cell_line test_cell_line
This is the test program, the k determines the length of the kmer which can be ranged from 3 to 7, the model_cell_line defines the trained model we use for predicting, the test_cell_line refers to the cell line we would like to make a prediction. eg:$ python EPBoost_Test.py 3 GM12878 NHEK
$ python EPBoost_Test.py 3 NHEK GM12878
$ python Predict.py k cell_line enchrome enstart enend prchrome prstart prend
This is the predicting program, the k determines the length of the kmer which can be ranged from 3 to 7 (here we provide a model setting k at 3), the cell_line defines the trained model we use for predicting, the enchrome, enstart, enend, prchrome, prstart, prend refer to the locations of the enhancer and promoter we would like to make a prediction, respectively.
$ python3 Predict.py 3 NHEK chr1 3399800 3400600 chr1 3541200 3542000
Output: For Promoter NHEK|chr1:3399800-3400600, Enhancer NHEK|chr1:3541200-3542000 in cell line NHEK : The two elements are predicted interacted by EPBoost, the interaction prediction score is 0.99766.$ python3 Predict.py 3 NHEK chr1 10000000 10001000 chr1 10004000 10005000
Output: For Promoter NHEK|chr1:10000000-10001000, Enhancer NHEK|chr1:10004000-10005000 in cell line NHEK : The two elements are predicted not to be interacted by EPBoost, the interaction prediction score is 0.0001.$ python3 Predict.py 3 NHEK chr1 10000000 10001000 chr2 10004000 10005000
Output: The two elements are not in the same chrosome, please recheck your input!
- Python (run on 3.6.8)
- scikit-learn (run on 0.21.3)
- numpy (run on 1.16.2)
- bedtools (run on 2.28.0)
- catboost (run on 0.20)
- matplotlib (run on 3.1.2)
- tqdm (run on 4.38.0)
EPBoost is licensed under the MIT License - details can be found in the LICENSE.md file
Please download the related profiles from https://github.com/ACatfromUSTC/EPBoost/.