Question: How to increase the speed of subread for DNA whole genome sequence
0
gravatar for tarek.mohamed
2.5 years ago by
tarek.mohamed250
tarek.mohamed250 wrote:

Hi All,

I am trying to align one DNA whole genome sequence file using subread, the sequence reads are paired end, 150 bp reads. I used 'thread=64' and 216 core (9 nodes, 24 core each). but after two days it processed only 15% and I had to kill the job.

I am using the university quest, so basically I can get as many nodes/cores as I want. Is there any way by which I can accelerate the alignment.

>align(index="my_index",readfile1=c("141023_H0E72_113-JG4_L003_R1.fastq.gz"),readfile2=c("141023_H0E72_113-JG4_L003_R2.fastq.gz"), output_file=c("subread_jg4.sam"),nthreads=64,unique=TRUE,type=1)

  ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
       Rsubread 1.22.3

//========================== subread-align setting===========================\\
||                                                                            ||
|| Function      : Read alignment (DNA-Seq)                                   ||
|| Input file 1  : 141023_H0E72_113-JG4_L003_R1.fastq.gz                      ||
|| Input file 2  : 141023_H0E72_113-JG4_L003_R2.fastq.gz                      ||
|| Output file   : subread_jg4.sam (BAM)                                      ||
|| Index name    : my_index                                                   ||
||                                                                            ||
||                       Threads : 40                                         ||
||                  Phred offset : 33                                         ||
||       # of extracted subreads : 10                                         ||
||                Min read1 vote : 3                                          ||
||                Min read2 vote : 1                                          ||
||             Max fragment size : 600                                        ||
||             Min fragment size : 50                                         ||
||    Maximum allowed mismatches : 3                                          ||
||   Maximum allowed indel bases : 5                                          ||
|| # of best alignments reported : 1                                          ||
||                Unique mapping : yes                                        ||
||                                                                            ||
\\===================== http://subread.sourceforge.net/ ==============

> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server release 6.4 (Santiago)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] Rsubread_1.22.3

Thanks

Tarek

subread dna sequence speed • 919 views
ADD COMMENTlink modified 10 months ago by andysw9020 • written 2.5 years ago by tarek.mohamed250
1) How many reads do you have (or, how big are your input files in GB)?
2) What reference are you aligning them to?
ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by Brian Bushnell16k

Hi Brian,

Thanks for the reply,

I have two files for paired end sequence, each is 35 GB. I am aligning to the human ref genome NCBI.GRch38.

Tarek

ADD REPLYlink written 2.5 years ago by tarek.mohamed250

OK, that's going way too slow and I have no idea why. I have not used Subread, though; are you sure it supports multiple nodes? Most software doesn't.

You could try BBMap (which I wrote) like this:

bbmap.sh -Xmx31g ref=humanGenome.fasta in=141023_H0E72_113-JG4_L003_R1.fastq.gz in2=141023_H0E72_113-JG4_L003_R2.fastq.gz out=mapped.sam maxindel=100

For your data I estimate it would finish in around 16 hours on one 24-core node . If you want to run on multiple nodes, you can first partition the data with partition.sh (included with BBMap), for example:

partition.sh in=r1.fq in2=r2.fq out=r1_part%.fq out2=r2_part%.fq ways=8

...then run each pair on a different node and merge the bam files.

ADD REPLYlink written 2.5 years ago by Brian Bushnell16k

Hello tarek.mohamed!

It appears that your post has been cross-posted to another site: https://support.bioconductor.org/p/87097/

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLYlink written 2.5 years ago by WouterDeCoster37k

Hi wouterDeCoster

I did not know that I am not allowed to post same post in 2 different site.

Tarek

ADD REPLYlink written 2.5 years ago by tarek.mohamed250

It is not that you are not allowed to cross-post but people who frequently contribute answers on Biostars frown upon this practice since it may unnecessarily lead to duplication of effort on part of those who provide answers.

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by genomax64k

I see.

Thanks

ADD REPLYlink written 2.5 years ago by tarek.mohamed250
0
gravatar for andysw90
10 months ago by
andysw9020
andysw9020 wrote:

Outputting to bam rather than sam might help -- smaller files = less time writing.

ADD COMMENTlink written 10 months ago by andysw9020

Doesn't writing BAM takes longer due to the compression time?

ADD REPLYlink written 10 months ago by ATpoint14k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1218 users visited in the last hour