Entering edit mode
6.5 years ago
tarek.mohamed ▴ 350
I am trying to align one DNA whole genome sequence file using subread, the sequence reads are paired end, 150 bp reads. I used 'thread=64' and 216 core (9 nodes, 24 core each). but after two days it processed only 15% and I had to kill the job.
I am using the university quest, so basically I can get as many nodes/cores as I want. Is there any way by which I can accelerate the alignment.
>align(index="my_index",readfile1=c("141023_H0E72_113-JG4_L003_R1.fastq.gz"),readfile2=c("141023_H0E72_113-JG4_L003_R2.fastq.gz"), output_file=c("subread_jg4.sam"),nthreads=64,unique=TRUE,type=1) ========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/ Rsubread 1.22.3 //========================== subread-align setting===========================\\ || || || Function : Read alignment (DNA-Seq) || || Input file 1 : 141023_H0E72_113-JG4_L003_R1.fastq.gz || || Input file 2 : 141023_H0E72_113-JG4_L003_R2.fastq.gz || || Output file : subread_jg4.sam (BAM) || || Index name : my_index || || || || Threads : 40 || || Phred offset : 33 || || # of extracted subreads : 10 || || Min read1 vote : 3 || || Min read2 vote : 1 || || Max fragment size : 600 || || Min fragment size : 50 || || Maximum allowed mismatches : 3 || || Maximum allowed indel bases : 5 || || # of best alignments reported : 1 || || Unique mapping : yes || || || \\===================== http://subread.sourceforge.net/ ============== > sessionInfo() R version 3.3.1 (2016-06-21) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Red Hat Enterprise Linux Server release 6.4 (Santiago) locale:  LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C  LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8  LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8  LC_PAPER=en_US.UTF-8 LC_NAME=C  LC_ADDRESS=C LC_TELEPHONE=C  LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages:  stats graphics grDevices utils datasets methods base other attached packages:  Rsubread_1.22.3
Thanks for the reply,
I have two files for paired end sequence, each is 35 GB. I am aligning to the human ref genome NCBI.GRch38.
OK, that's going way too slow and I have no idea why. I have not used Subread, though; are you sure it supports multiple nodes? Most software doesn't.
You could try BBMap (which I wrote) like this:
bbmap.sh -Xmx31g ref=humanGenome.fasta in=141023_H0E72_113-JG4_L003_R1.fastq.gz in2=141023_H0E72_113-JG4_L003_R2.fastq.gz out=mapped.sam maxindel=100
For your data I estimate it would finish in around 16 hours on one 24-core node . If you want to run on multiple nodes, you can first partition the data with partition.sh (included with BBMap), for example:
partition.sh in=r1.fq in2=r2.fq out=r1_part%.fq out2=r2_part%.fq ways=8
...then run each pair on a different node and merge the bam files.
It appears that your post has been cross-posted to another site: https://support.bioconductor.org/p/87097/
This is typically not recommended as it runs the risk of annoying people in both communities.
I did not know that I am not allowed to post same post in 2 different site.
It is not that you are not allowed to cross-post but people who frequently contribute answers on Biostars frown upon this practice since it may unnecessarily lead to duplication of effort on part of those who provide answers.