Question: Efficiently aligning a lot of reads on the same small reference sequence
gravatar for cedric.stroobandt
14 months ago by
cedric.stroobandt0 wrote:

The context: I have a DNA-sequence coding for a protein, about 1500 bp in length. Using NGS, a lot of reads of (mutants of) this same sequence were acquired. All of these reads need to be aligned to the reference. We're talking about a lot of reads (100,000 - 1,000,000) of not too short length (120 - 300 bp). All of these reads can belong to different mutants of the template sequence, so the alignment is necessary to determine the exact sequence of every single mutant.

Currently, I'm just using Smith-Waterman-based local alignment to align every single read to the reference one by one. Yet I can't help but feel there might be a more computationally/time-efficient solution to this specific problem.

Maybe there exists an algorithm that's not very efficient for most alignment problems, but that becomes very worthwile if it has to map a ton of reads to the same place over and over again. For example, it might do some time-consuming operations on the short template that make it fast to align reads to it, but these operations aren't worth it if only aligning a couple of reads. That's just an idea, I don't know all the different techniques that are out there.

So, to recap: I want to align a lot (100,000 - 1,000,000) of 120-300 bp reads to the same short 1500 bp reference sequence. If anyone has any suggestions about an algorithm, or just a specific workflow, that is particularly suited to do this, it would be appreciated. I work in R so I can implement some things myself, it doesn't have to be a ready-to-use software package or anything like that. Thanks in advance!

Note: I'm new at this, but I also posted this question at But this community seems to be more specialized and active. Naturally, I will report it if helpful answers show up there

next-gen alignment • 828 views
ADD COMMENTlink written 14 months ago by cedric.stroobandt0
gravatar for Joe
14 months ago by
United Kingdom
Joe14k wrote:

Which types of reads do you have? Fastqs?

Why not just use a normal aligner like BWA or Bowtie and see what you get? They will be more than capable of handling small references and lots of reads.

(Incidentally, even 100000 reads will be nothing to these aligners ;) )

ADD COMMENTlink modified 14 months ago • written 14 months ago by Joe14k

It is not clear what kind of data OP has, but both BWA and Bowtie / Bowtie2 accept fasta and fastq as input, so this would cover almost all input data.

ADD REPLYlink modified 14 months ago by Nicolas Rosewick8.3k • written 14 months ago by h.mon27k

The data is indeed fastq format.

I was mainly wondering if there existed specific alignment methods that are very efficient if the reference is really short (so it is sequenced extremely deep), while BWA and Bowtie were designed with very long references in mind (whole genomes). But this may not be the case, and maybe these algorithms are the optimal choice for short references as well. Something like "if it can handle a long reference, it will perform even better with a short one".

That would also be an answer to my question! I'll wait just a little longer to see if extra suggestions for this specific problem come in, but otherwise I'll just accept this answer. Thanks!

ADD REPLYlink written 14 months ago by cedric.stroobandt0

There is no difference between using a long or short reference as far as the aligners/mappers are concerned.

High coverage depth is typically only a problem for de Bruijn graph assemblers which can sometimes choke, obscuring sub populations of reads with variants. It shouldn’t matter for an aligner.

ADD REPLYlink written 14 months ago by Joe14k

Some mappers need special care when building indexes for very small or very large genomes. STAR, for example, need some parameters change:

For small genomes, the parameter --genomeSAindexNbases must to be scaled down, with a typical value of min(14, log2(GenomeLength)/2 -1). For example, for 1 megaBase genome, this is equal to 9, for 100 kiloBase genome, this is equal to 7.

ADD REPLYlink written 14 months ago by h.mon27k

I stand corrected in that case, but I’d wager BWA and Bowtie would give adequate results with default parameters.

ADD REPLYlink written 14 months ago by Joe14k

I’d wager BWA and Bowtie would give adequate results with default parameters.

Indeed. I was just being picky.

ADD REPLYlink written 14 months ago by h.mon27k

Well, thanks for your insights, everyone! I'll give BWA and/or Bowtie a shot. Maybe later I'll comment on how it went, but for now I'll consider the question answered.

ADD REPLYlink written 14 months ago by cedric.stroobandt0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2363 users visited in the last hour