Question: How to increase the speed of subread for DNA whole genome sequence
gravatar for tarek.mohamed
3.9 years ago by
tarek.mohamed270 wrote:

Hi All,

I am trying to align one DNA whole genome sequence file using subread, the sequence reads are paired end, 150 bp reads. I used 'thread=64' and 216 core (9 nodes, 24 core each). but after two days it processed only 15% and I had to kill the job.

I am using the university quest, so basically I can get as many nodes/cores as I want. Is there any way by which I can accelerate the alignment.

>align(index="my_index",readfile1=c("141023_H0E72_113-JG4_L003_R1.fastq.gz"),readfile2=c("141023_H0E72_113-JG4_L003_R2.fastq.gz"), output_file=c("subread_jg4.sam"),nthreads=64,unique=TRUE,type=1)

  ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
       Rsubread 1.22.3

//========================== subread-align setting===========================\\
||                                                                            ||
|| Function      : Read alignment (DNA-Seq)                                   ||
|| Input file 1  : 141023_H0E72_113-JG4_L003_R1.fastq.gz                      ||
|| Input file 2  : 141023_H0E72_113-JG4_L003_R2.fastq.gz                      ||
|| Output file   : subread_jg4.sam (BAM)                                      ||
|| Index name    : my_index                                                   ||
||                                                                            ||
||                       Threads : 40                                         ||
||                  Phred offset : 33                                         ||
||       # of extracted subreads : 10                                         ||
||                Min read1 vote : 3                                          ||
||                Min read2 vote : 1                                          ||
||             Max fragment size : 600                                        ||
||             Min fragment size : 50                                         ||
||    Maximum allowed mismatches : 3                                          ||
||   Maximum allowed indel bases : 5                                          ||
|| # of best alignments reported : 1                                          ||
||                Unique mapping : yes                                        ||
||                                                                            ||
\\===================== ==============

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.4 (Santiago)

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Rsubread_1.22.3



subread dna sequence speed • 1.3k views
ADD COMMENTlink modified 2.2 years ago by andysw9020 • written 3.9 years ago by tarek.mohamed270
1) How many reads do you have (or, how big are your input files in GB)?
2) What reference are you aligning them to?
ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by Brian Bushnell17k

Hi Brian,

Thanks for the reply,

I have two files for paired end sequence, each is 35 GB. I am aligning to the human ref genome NCBI.GRch38.


ADD REPLYlink written 3.9 years ago by tarek.mohamed270

OK, that's going way too slow and I have no idea why. I have not used Subread, though; are you sure it supports multiple nodes? Most software doesn't.

You could try BBMap (which I wrote) like this: -Xmx31g ref=humanGenome.fasta in=141023_H0E72_113-JG4_L003_R1.fastq.gz in2=141023_H0E72_113-JG4_L003_R2.fastq.gz out=mapped.sam maxindel=100

For your data I estimate it would finish in around 16 hours on one 24-core node . If you want to run on multiple nodes, you can first partition the data with (included with BBMap), for example: in=r1.fq in2=r2.fq out=r1_part%.fq out2=r2_part%.fq ways=8

...then run each pair on a different node and merge the bam files.

ADD REPLYlink written 3.9 years ago by Brian Bushnell17k

Hello tarek.mohamed!

It appears that your post has been cross-posted to another site:

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLYlink written 3.9 years ago by WouterDeCoster44k

Hi wouterDeCoster

I did not know that I am not allowed to post same post in 2 different site.


ADD REPLYlink written 3.9 years ago by tarek.mohamed270

It is not that you are not allowed to cross-post but people who frequently contribute answers on Biostars frown upon this practice since it may unnecessarily lead to duplication of effort on part of those who provide answers.

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by genomax87k

I see.


ADD REPLYlink written 3.9 years ago by tarek.mohamed270
gravatar for andysw90
2.2 years ago by
andysw9020 wrote:

Outputting to bam rather than sam might help -- smaller files = less time writing.

ADD COMMENTlink written 2.2 years ago by andysw9020

Doesn't writing BAM takes longer due to the compression time?

ADD REPLYlink written 2.2 years ago by ATpoint36k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1301 users visited in the last hour