Question

How to increase the speed of subread for DNA whole genome sequence

0

Entering edit mode

7.6 years ago

tarek.mohamed ▴ 360

Hi All,

I am trying to align one DNA whole genome sequence file using subread, the sequence reads are paired end, 150 bp reads. I used 'thread=64' and 216 core (9 nodes, 24 core each). but after two days it processed only 15% and I had to kill the job.

I am using the university quest, so basically I can get as many nodes/cores as I want. Is there any way by which I can accelerate the alignment.

>align(index="my_index",readfile1=c("141023_H0E72_113-JG4_L003_R1.fastq.gz"),readfile2=c("141023_H0E72_113-JG4_L003_R2.fastq.gz"), output_file=c("subread_jg4.sam"),nthreads=64,unique=TRUE,type=1)

  ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
       Rsubread 1.22.3

//========================== subread-align setting===========================\\
||                                                                            ||
|| Function      : Read alignment (DNA-Seq)                                   ||
|| Input file 1  : 141023_H0E72_113-JG4_L003_R1.fastq.gz                      ||
|| Input file 2  : 141023_H0E72_113-JG4_L003_R2.fastq.gz                      ||
|| Output file   : subread_jg4.sam (BAM)                                      ||
|| Index name    : my_index                                                   ||
||                                                                            ||
||                       Threads : 40                                         ||
||                  Phred offset : 33                                         ||
||       # of extracted subreads : 10                                         ||
||                Min read1 vote : 3                                          ||
||                Min read2 vote : 1                                          ||
||             Max fragment size : 600                                        ||
||             Min fragment size : 50                                         ||
||    Maximum allowed mismatches : 3                                          ||
||   Maximum allowed indel bases : 5                                          ||
|| # of best alignments reported : 1                                          ||
||                Unique mapping : yes                                        ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ==============

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.4 (Santiago)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Rsubread_1.22.3

Thanks

Tarek

subread speed DNA sequence • 2.5k views

ADD COMMENT • link updated 5.9 years ago by andysw90 ▴ 20 • written 7.6 years ago by tarek.mohamed ▴ 360

0

Entering edit mode

1) How many reads do you have (or, how big are your input files in GB)?
2) What reference are you aligning them to?

ADD REPLY • link 7.6 years ago by Brian Bushnell 20k

0

Entering edit mode

Hi Brian,

Thanks for the reply,

I have two files for paired end sequence, each is 35 GB. I am aligning to the human ref genome NCBI.GRch38.

Tarek

ADD REPLY • link 7.6 years ago by tarek.mohamed ▴ 360

0

Entering edit mode

OK, that's going way too slow and I have no idea why. I have not used Subread, though; are you sure it supports multiple nodes? Most software doesn't.

You could try BBMap (which I wrote) like this:

bbmap.sh -Xmx31g ref=humanGenome.fasta in=141023_H0E72_113-JG4_L003_R1.fastq.gz in2=141023_H0E72_113-JG4_L003_R2.fastq.gz out=mapped.sam maxindel=100

For your data I estimate it would finish in around 16 hours on one 24-core node . If you want to run on multiple nodes, you can first partition the data with partition.sh (included with BBMap), for example:

partition.sh in=r1.fq in2=r2.fq out=r1_part%.fq out2=r2_part%.fq ways=8

...then run each pair on a different node and merge the bam files.

ADD REPLY • link 7.6 years ago by Brian Bushnell 20k

0

Entering edit mode

Hello tarek.mohamed!

It appears that your post has been cross-posted to another site: https://support.bioconductor.org/p/87097/

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLY • link 7.6 years ago by WouterDeCoster 47k

0

Entering edit mode

Hi wouterDeCoster

I did not know that I am not allowed to post same post in 2 different site.

Tarek

ADD REPLY • link 7.6 years ago by tarek.mohamed ▴ 360

0

Entering edit mode

It is not that you are not allowed to cross-post but people who frequently contribute answers on Biostars frown upon this practice since it may unnecessarily lead to duplication of effort on part of those who provide answers.

ADD REPLY • link 7.6 years ago by GenoMax 141k

0

Entering edit mode

I see.

Thanks

ADD REPLY • link 7.6 years ago by tarek.mohamed ▴ 360

score 0 · Answer 1 · 2018-05-18

0

Entering edit mode

5.9 years ago

andysw90 ▴ 20

Outputting to bam rather than sam might help -- smaller files = less time writing.