Question: Efficiently aligning a lot of reads on the same small reference sequence
0
gravatar for cedric.stroobandt
12 weeks ago by
cedric.stroobandt0 wrote:

The context: I have a DNA-sequence coding for a protein, about 1500 bp in length. Using NGS, a lot of reads of (mutants of) this same sequence were acquired. All of these reads need to be aligned to the reference. We're talking about a lot of reads (100,000 - 1,000,000) of not too short length (120 - 300 bp). All of these reads can belong to different mutants of the template sequence, so the alignment is necessary to determine the exact sequence of every single mutant.

Currently, I'm just using Smith-Waterman-based local alignment to align every single read to the reference one by one. Yet I can't help but feel there might be a more computationally/time-efficient solution to this specific problem.

Maybe there exists an algorithm that's not very efficient for most alignment problems, but that becomes very worthwile if it has to map a ton of reads to the same place over and over again. For example, it might do some time-consuming operations on the short template that make it fast to align reads to it, but these operations aren't worth it if only aligning a couple of reads. That's just an idea, I don't know all the different techniques that are out there.

So, to recap: I want to align a lot (100,000 - 1,000,000) of 120-300 bp reads to the same short 1500 bp reference sequence. If anyone has any suggestions about an algorithm, or just a specific workflow, that is particularly suited to do this, it would be appreciated. I work in R so I can implement some things myself, it doesn't have to be a ready-to-use software package or anything like that. Thanks in advance!

Note: I'm new at this, but I also posted this question at https://bioinformatics.stackexchange.com/questions/4913/efficiently-aligning-a-lot-of-reads-on-the-same-small-reference-sequence. But this community seems to be more specialized and active. Naturally, I will report it if helpful answers show up there

next-gen alignment • 272 views
ADD COMMENTlink written 12 weeks ago by cedric.stroobandt0
5
gravatar for jrj.healey
12 weeks ago by
jrj.healey8.5k
United Kingdom
jrj.healey8.5k wrote:

Which types of reads do you have? Fastqs?

Why not just use a normal aligner like BWA or Bowtie and see what you get? They will be more than capable of handling small references and lots of reads.

(Incidentally, even 100000 reads will be nothing to these aligners ;) )

ADD COMMENTlink modified 12 weeks ago • written 12 weeks ago by jrj.healey8.5k
1

It is not clear what kind of data OP has, but both BWA and Bowtie / Bowtie2 accept fasta and fastq as input, so this would cover almost all input data.

ADD REPLYlink modified 12 weeks ago by Nicolas Rosewick6.9k • written 12 weeks ago by h.mon21k

The data is indeed fastq format.

I was mainly wondering if there existed specific alignment methods that are very efficient if the reference is really short (so it is sequenced extremely deep), while BWA and Bowtie were designed with very long references in mind (whole genomes). But this may not be the case, and maybe these algorithms are the optimal choice for short references as well. Something like "if it can handle a long reference, it will perform even better with a short one".

That would also be an answer to my question! I'll wait just a little longer to see if extra suggestions for this specific problem come in, but otherwise I'll just accept this answer. Thanks!

ADD REPLYlink written 12 weeks ago by cedric.stroobandt0

There is no difference between using a long or short reference as far as the aligners/mappers are concerned.

High coverage depth is typically only a problem for de Bruijn graph assemblers which can sometimes choke, obscuring sub populations of reads with variants. It shouldn’t matter for an aligner.

ADD REPLYlink written 12 weeks ago by jrj.healey8.5k

Some mappers need special care when building indexes for very small or very large genomes. STAR, for example, need some parameters change:

For small genomes, the parameter --genomeSAindexNbases must to be scaled down, with a typical value of min(14, log2(GenomeLength)/2 -1). For example, for 1 megaBase genome, this is equal to 9, for 100 kiloBase genome, this is equal to 7.

ADD REPLYlink written 12 weeks ago by h.mon21k

I stand corrected in that case, but I’d wager BWA and Bowtie would give adequate results with default parameters.

ADD REPLYlink written 12 weeks ago by jrj.healey8.5k
1

I’d wager BWA and Bowtie would give adequate results with default parameters.

Indeed. I was just being picky.

ADD REPLYlink written 12 weeks ago by h.mon21k

Well, thanks for your insights, everyone! I'll give BWA and/or Bowtie a shot. Maybe later I'll comment on how it went, but for now I'll consider the question answered.

ADD REPLYlink written 12 weeks ago by cedric.stroobandt0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1876 users visited in the last hour