Question: Do Aligners Tend To Scale With Number Of Bases Or With Number Of Reads?
5
gravatar for Ian
9.0 years ago by
Ian50
Ian50 wrote:

I would like to find an efficient way of aligning large fastq files (to the human reference genome) by first splitting-up the fastq into smaller pieces so that they can be aligned in parallel. I can think of two ways of doing this: splitting the fastq up either into files with a fixed number of bases (e.g. a billion bases per file) or into files with a fixed number of reads (e.g. 10 million reads per file). I was wondering if anyone knows which approach should be more efficient in terms of run time? This question is particularly applicable when different fastq files have different read lengths.

I suppose another way of asking the way question is: Do aligners tend to scale with number of bases or with number of reads (in terms of run time)? The aligners I am most interested in are BWA, BFAST and stampy.

Many thanks,

Ian

fastq alignment bwa • 1.7k views
ADD COMMENTlink written 9.0 years ago by Ian50
7
gravatar for lh3
9.0 years ago by
lh332k
United States
lh332k wrote:

BWA roughly scales with the number of bases. Nonetheless, I do not think it matters at all with data splitting. The total CPU time is roughly fixed. The wall-clock time depends on how many CPU cores you use at once.

ADD COMMENTlink written 9.0 years ago by lh332k
1
gravatar for Manu Prestat
9.0 years ago by
Manu Prestat4.0k
Lyon, France
Manu Prestat4.0k wrote:

I think that the best way to split your file, is to generate files with the same size (roughly the same as the number of residues). Genometools (very very fast) is your friend.

gt splitfasta -targetsize 50 file.fasta
ADD COMMENTlink written 9.0 years ago by Manu Prestat4.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2450 users visited in the last hour
_