Data processing and analysis tools for scTaILoR-seq
The tools in this repository are intended to create cell-by-transcript counts tables from raw data generated from scTaILoR-seq experiments.
Data processing was performed on a High-Performance Computing (HPC) platform running CentOS 7.9.
anaconda3version 5.0.1pythonversion 3.6.3pysamversion 0.21.0javaversion 11.0.2nextflowversion 22.04.4wf-single-cellversion 0.1.5 (epi2me-labs)umi-toolsversion 1.1.2 (conda environment)isoquantversion 3.3 (conda environment)samtoolsversion 1.18
Installation time in the order of minutes.
- Clone this repository
- Install listed requirements
- Clone
wf-single-cellworkflow - Create
umitoolsconda environment:conda env create -f umi_tools_conda_enviroment.yml - Create
isoquantconda environment:conda env create -f isoquant_conda_enviroment.yml
Runtime in the order of hours to days depending on sequencing depth. Test data requires approximately 8 hours using 4 cores and 50 GB RAM.
Download LR_3CL_cancer_R1_1.sub1000k.fastq.gz using SRA-toolkit and BioProject accession PRJNA993664. See tutorial here: https://github.com/ncbi/sra-tools/wiki/HowTo:-fasterq-dump. Then, downsample data to 1 million reads using SeqKit (https://bioinf.shenwei.me/seqkit/usage/#sample).
Details found here: https://github.com/epi2me-labs/wf-single-cell
nextflow run epi2me-labs/wf-single-cell \
-r v0.1.5 \
-w ${OUTPUT}/${PREFIX}_workspace \
-c ${CONFIG} \
-profile singularity \
--max_threads ${CORES} \
--resources_mm2_max_threads ${CORES} \
--fastq ${FQ} \
--ref_genome_dir ${REFERENCE} \
--out_dir ${OUTPUT}
# merge .bam intermediates
cd ${OUTPUT}/${PREFIX}/bams
samtools merge wf_SC.bam *.bam
samtools index wf_SC.bam
Intermediate chromosome-specific .bam files are created from the wf-single-cell workflow. Then, samtools is required to merge these .bam files before UMI group assignment using the umi-tools package.
Details found here: https://github.com/CGATOxford/UMI-tools
# prior to running `umi-tools`, filter out records missing a 12-n.t. UMI sequence.
python3 tidy_UMI.py ${OUTPUT}/${PREFIX}/bams/wf_SC.bam # output has a .tidy.bam suffix
# load umitools conda env
source activate umitools
# tag merged bam
umi_tools group \
--output-bam \
--stdin=${OUTPUT}/${PREFIX}/bams/wf_SC.tidy.bam \
--stdout=${OUTPUT}/${PREFIX}/bams/wf_SC.grouped.bam \
--per-cell \
--per-gene \
--extract-umi-method=tag \
--umi-tag=UB \
--cell-tag=CB \
--gene-tag=GN
# keep longest read in each UMI group (dedup_UMI.py included in git repo)
### Alternatively, use `umi_tools dedup`
python3 dedup_UMI.py ${OUTPUT}/${PREFIX}/bams/wf_SC.grouped.bam
samtools view \
-N ${OUTPUT}/${PREFIX}/bams/qname_umitools.txt \
-o ${OUTPUT}/${PREFIX}/bams/wf_SC.dedup.bam \
${OUTPUT}/${PREFIX}/bams/wf_SC.grouped.bam
samtools sort \
-o ${OUTPUT}/${PREFIX}/bams/wf_SC.dedup.sorted.bam
${OUTPUT}/${PREFIX}/bams/wf_SC.dedup.bam
samtools index ${OUTPUT}/${PREFIX}/bams/wf_SC.dedup.sorted.bam
Resultant .bam file contains representative sequence for each UMI group.
Details found here: https://github.com/ablab/IsoQuant
# load isoquant conda env
source activate isoquant
# run isoquant
isoquant.py \
--reference ${REFERENCE} \
--genedb ${GTF} \
--bam ${OUTPUT}/${PREFIX}/bams/wf_SC.dedup.sorted.bam \
--data_type nanopore \
--read_group tag:CB \
-o ${OUTPUT}/isoquant
The cell-by-gene and cell-by-transcript tables (SAMPLE_ID.gene_grouped_counts.tsv and SAMPLE_ID.transcript_grouped_counts.tsv, respectively) were used in downstream analyses.
See files with analysis. prefix.
analysis.merge.py= Regression-based matrix merge (two or more targeting panels on a single sample)analysis.haplotype.py= Haplotype determination