Question

Best aligner for reporting all exact matches of multi-mapping short reads?

1

Entering edit mode

7.9 years ago

Ryan Thompson ★ 3.6k

I have about 30 GB (gzipped fastq) of small RNA (miRNA, etc.) sequencing data that I've been asked to align to the human genome (hg38), reporting for each read every position in the genome where it matches exactly. Bowtie2 has been my go-to mapper for short reads in the past, so I put together a pipeline something like this:

zcat Sample_1.fastq.gz | \
  cutadapt -a TGGAATTCTCGGGTGCCAAGG -f fastq --overlap 15 \
    --trimmed-only --minimum-length 10 - | \
  bowtie2 -x ~/references/hg38/hg38-bt2/hg38 -U - -q --phred33 \
    --end-to-end --very-sensitive -a --score-min C,0,-1 -L 10 \
  picard-tools SortSam INPUT=/dev/stdin OUTPUT=Sample_1.bam;
picard-tools BuildBamIndex INPUT=Sample_1.bam OUTPUT=Sample_1.bam.bai

However, the Bowtie2 manual carries this warning about the -a option for reporting all alignments for each read:

Some tools are designed with this reporting mode in mind. Bowtie 2 is not! For very large genomes, this mode is very slow.

Nevertheless, because I already had it set up, I decided to give it a try. Apparently, the statement is quite accurate, because one sample (about 300 MB before trimming) has already been running for 168 CPU-hours without finishing, which is at least an order of magnitude or two longer than I'd typically expect a 300 MB fastq file to require when looking for just one or a few alignments for each read. So, is there either another set of options for bowtie2 or, more likely, another aligner better suited for the task of quickly finding all exact matches for each read?

miRNA alignment speed bowtie2 • 4.5k views

ADD COMMENT • link 7.9 years ago by Ryan Thompson ★ 3.6k

1

Entering edit mode

Try BBMap. You would want to use ambig=all option to get all matches. Play with other other parameters (idfilter) for getting precise matches.

ADD REPLY • link 7.9 years ago by GenoMax 142k

1

Entering edit mode

This seems like a good option. Maybe you should convert this comment to an answer?

ADD REPLY • link 7.9 years ago by Ryan Thompson ★ 3.6k

score 4 · Accepted Answer · 2016-06-10

4

Entering edit mode

7.9 years ago

Brian Bushnell 20k

Just to add, if all you care about is all perfect alignments, I recommend these BBMap flags:

ambig=all vslow perfectmode maxsites=1000

It should be very fast in that mode (despite the vslow flag). Vslow mainly removes masking of low-complexity repetitive kmers, which is not usually a problem but can be with extremely short sequences like microRNAs.

ADD COMMENT • link 7.9 years ago by Brian Bushnell 20k

0

Entering edit mode

This worked great! Thanks for the suggestion.

ADD REPLY • link 7.9 years ago by Ryan Thompson ★ 3.6k

0

Entering edit mode

What if I (by which I mean my collaborators) also want to map directly to a database of mature miRNA sequences instead of mapping to the genome? I expect the same options to mostly work, but the one case I'd worry about is when a read is longer than an annotated mature miRNA sequence by one or 2 bases. This would presumably no longer constitute an exact match, even when the overlapping bases match exactly. Is there any way to tell BBMap that overhangs on the end of the reference should be allowed? I see the "semiperfectmode" option, which allows otherwise perfect matches to overhang into N's in the reference, but it's not clear if this option also allows overhangs at the ends of the reference.

ADD REPLY • link 7.9 years ago by Ryan Thompson ★ 3.6k

0

Entering edit mode

Hi Ryan,

Yes, "semiperfectmode" is for that specific case. Internally, the ends of contigs are padded with Ns for alignment, so going off the end by two bases is equivalent score-wise to having 2 no-calls in the reference. Before the aligned read is printed to the same file those bases are soft-clipped as per the sam specification, so you don't see them in the cigar string, but it is still considered a "semi-perfect" alignment.

-Brian

ADD REPLY • link 7.9 years ago by Brian Bushnell 20k