The context: I have a DNA-sequence coding for a protein, about 1500 bp in length. Using NGS, a lot of reads of (mutants of) this same sequence were acquired. All of these reads need to be aligned to the reference. We're talking about a lot of reads (100,000 - 1,000,000) of not too short length (120 - 300 bp). All of these reads can belong to different mutants of the template sequence, so the alignment is necessary to determine the exact sequence of every single mutant.
Currently, I'm just using Smith-Waterman-based local alignment to align every single read to the reference one by one. Yet I can't help but feel there might be a more computationally/time-efficient solution to this specific problem.
Maybe there exists an algorithm that's not very efficient for most alignment problems, but that becomes very worthwile if it has to map a ton of reads to the same place over and over again. For example, it might do some time-consuming operations on the short template that make it fast to align reads to it, but these operations aren't worth it if only aligning a couple of reads. That's just an idea, I don't know all the different techniques that are out there.
So, to recap: I want to align a lot (100,000 - 1,000,000) of 120-300 bp reads to the same short 1500 bp reference sequence. If anyone has any suggestions about an algorithm, or just a specific workflow, that is particularly suited to do this, it would be appreciated. I work in R so I can implement some things myself, it doesn't have to be a ready-to-use software package or anything like that. Thanks in advance!
Note: I'm new at this, but I also posted this question at https://bioinformatics.stackexchange.com/questions/4913/efficiently-aligning-a-lot-of-reads-on-the-same-small-reference-sequence. But this community seems to be more specialized and active. Naturally, I will report it if helpful answers show up there
It is not clear what kind of data OP has, but both BWA and Bowtie / Bowtie2 accept fasta and fastq as input, so this would cover almost all input data.
The data is indeed fastq format.
I was mainly wondering if there existed specific alignment methods that are very efficient if the reference is really short (so it is sequenced extremely deep), while BWA and Bowtie were designed with very long references in mind (whole genomes). But this may not be the case, and maybe these algorithms are the optimal choice for short references as well. Something like "if it can handle a long reference, it will perform even better with a short one".
That would also be an answer to my question! I'll wait just a little longer to see if extra suggestions for this specific problem come in, but otherwise I'll just accept this answer. Thanks!
There is no difference between using a long or short reference as far as the aligners/mappers are concerned.
High coverage depth is typically only a problem for de Bruijn graph assemblers which can sometimes choke, obscuring sub populations of reads with variants. It shouldn’t matter for an aligner.
Some mappers need special care when building indexes for very small or very large genomes. STAR, for example, need some parameters change:
I stand corrected in that case, but I’d wager BWA and Bowtie would give adequate results with default parameters.
Indeed. I was just being picky.
Well, thanks for your insights, everyone! I'll give BWA and/or Bowtie a shot. Maybe later I'll comment on how it went, but for now I'll consider the question answered.
Hi Cedric, I am undertaking a similar project and I was wondering, which software worked best for you and if you have any suggestions or recommendations? My reference sequence is around 1500bp as well