Question

Efficiently aligning a lot of reads on the same small reference sequence

0

Entering edit mode

5.7 years ago

cedric.stroobandt • 0

The context: I have a DNA-sequence coding for a protein, about 1500 bp in length. Using NGS, a lot of reads of (mutants of) this same sequence were acquired. All of these reads need to be aligned to the reference. We're talking about a lot of reads (100,000 - 1,000,000) of not too short length (120 - 300 bp). All of these reads can belong to different mutants of the template sequence, so the alignment is necessary to determine the exact sequence of every single mutant.

Currently, I'm just using Smith-Waterman-based local alignment to align every single read to the reference one by one. Yet I can't help but feel there might be a more computationally/time-efficient solution to this specific problem.

Maybe there exists an algorithm that's not very efficient for most alignment problems, but that becomes very worthwile if it has to map a ton of reads to the same place over and over again. For example, it might do some time-consuming operations on the short template that make it fast to align reads to it, but these operations aren't worth it if only aligning a couple of reads. That's just an idea, I don't know all the different techniques that are out there.

So, to recap: I want to align a lot (100,000 - 1,000,000) of 120-300 bp reads to the same short 1500 bp reference sequence. If anyone has any suggestions about an algorithm, or just a specific workflow, that is particularly suited to do this, it would be appreciated. I work in R so I can implement some things myself, it doesn't have to be a ready-to-use software package or anything like that. Thanks in advance!

Note: I'm new at this, but I also posted this question at https://bioinformatics.stackexchange.com/questions/4913/efficiently-aligning-a-lot-of-reads-on-the-same-small-reference-sequence. But this community seems to be more specialized and active. Naturally, I will report it if helpful answers show up there

alignment next-gen • 3.8k views

ADD COMMENT • link updated 2.7 years ago by Sarah • 0 • written 5.7 years ago by cedric.stroobandt • 0

Nicolas Rosewick · Accepted Answer · 2018-08-19

5

Entering edit mode

5.7 years ago

Joe 21k

Which types of reads do you have? Fastqs?

Why not just use a normal aligner like BWA or Bowtie and see what you get? They will be more than capable of handling small references and lots of reads.

(Incidentally, even 100000 reads will be nothing to these aligners ;) )

ADD COMMENT • link 5.7 years ago by Joe 21k

1

Entering edit mode

It is not clear what kind of data OP has, but both BWA and Bowtie / Bowtie2 accept fasta and fastq as input, so this would cover almost all input data.

ADD REPLY • link updated 5.7 years ago by Nicolas Rosewick 11k • written 5.7 years ago by h.mon 35k

0

Entering edit mode

The data is indeed fastq format.

I was mainly wondering if there existed specific alignment methods that are very efficient if the reference is really short (so it is sequenced extremely deep), while BWA and Bowtie were designed with very long references in mind (whole genomes). But this may not be the case, and maybe these algorithms are the optimal choice for short references as well. Something like "if it can handle a long reference, it will perform even better with a short one".

That would also be an answer to my question! I'll wait just a little longer to see if extra suggestions for this specific problem come in, but otherwise I'll just accept this answer. Thanks!

ADD REPLY • link 5.7 years ago by cedric.stroobandt • 0

0

Entering edit mode

There is no difference between using a long or short reference as far as the aligners/mappers are concerned.

High coverage depth is typically only a problem for de Bruijn graph assemblers which can sometimes choke, obscuring sub populations of reads with variants. It shouldn’t matter for an aligner.

ADD REPLY • link 5.7 years ago by Joe 21k

0

Entering edit mode

Some mappers need special care when building indexes for very small or very large genomes. STAR, for example, need some parameters change:

For small genomes, the parameter --genomeSAindexNbases must to be scaled down, with a typical value of min(14, log2(GenomeLength)/2 -1). For example, for 1 megaBase genome, this is equal to 9, for 100 kiloBase genome, this is equal to 7.

ADD REPLY • link 5.7 years ago by h.mon 35k

0

Entering edit mode

I stand corrected in that case, but I’d wager BWA and Bowtie would give adequate results with default parameters.

ADD REPLY • link 5.7 years ago by Joe 21k

1

Entering edit mode

I’d wager BWA and Bowtie would give adequate results with default parameters.

Indeed. I was just being picky.

ADD REPLY • link 5.7 years ago by h.mon 35k

0

Entering edit mode

Well, thanks for your insights, everyone! I'll give BWA and/or Bowtie a shot. Maybe later I'll comment on how it went, but for now I'll consider the question answered.

ADD REPLY • link 5.7 years ago by cedric.stroobandt • 0

0

Entering edit mode

Hi Cedric, I am undertaking a similar project and I was wondering, which software worked best for you and if you have any suggestions or recommendations? My reference sequence is around 1500bp as well

ADD REPLY • link 2.7 years ago by Sarah • 0