Question: Best aligner for reporting all exact matches of multi-mapping short reads?
gravatar for Ryan Thompson
4.7 years ago by
Ryan Thompson3.5k
TSRI, La Jolla, CA
Ryan Thompson3.5k wrote:

I have about 30 GB (gzipped fastq) of small RNA (miRNA, etc.) sequencing data that I've been asked to align to the human genome (hg38), reporting for each read every position in the genome where it matches exactly. Bowtie2 has been my go-to mapper for short reads in the past, so I put together a pipeline something like this:

zcat Sample_1.fastq.gz | \
  cutadapt -a TGGAATTCTCGGGTGCCAAGG -f fastq --overlap 15 \
    --trimmed-only --minimum-length 10 - | \
  bowtie2 -x ~/references/hg38/hg38-bt2/hg38 -U - -q --phred33 \
    --end-to-end --very-sensitive -a --score-min C,0,-1 -L 10 \
  picard-tools SortSam INPUT=/dev/stdin OUTPUT=Sample_1.bam;
picard-tools BuildBamIndex INPUT=Sample_1.bam OUTPUT=Sample_1.bam.bai

However, the Bowtie2 manual carries this warning about the -a option for reporting all alignments for each read:

Some tools are designed with this reporting mode in mind. Bowtie 2 is not! For very large genomes, this mode is very slow.

Nevertheless, because I already had it set up, I decided to give it a try. Apparently, the statement is quite accurate, because one sample (about 300 MB before trimming) has already been running for 168 CPU-hours without finishing, which is at least an order of magnitude or two longer than I'd typically expect a 300 MB fastq file to require when looking for just one or a few alignments for each read. So, is there either another set of options for bowtie2 or, more likely, another aligner better suited for the task of quickly finding all exact matches for each read?

mirna alignment speed bowtie2 • 2.6k views
ADD COMMENTlink modified 4.7 years ago • written 4.7 years ago by Ryan Thompson3.5k

Try BBMap. You would want to use ambig=all option to get all matches. Play with other other parameters (idfilter) for getting precise matches.

ADD REPLYlink modified 4.7 years ago • written 4.7 years ago by GenoMax96k

This seems like a good option. Maybe you should convert this comment to an answer?

ADD REPLYlink modified 4.7 years ago • written 4.7 years ago by Ryan Thompson3.5k
gravatar for Brian Bushnell
4.7 years ago by
Walnut Creek, USA
Brian Bushnell17k wrote:

Just to add, if all you care about is all perfect alignments, I recommend these BBMap flags:

ambig=all vslow perfectmode maxsites=1000

It should be very fast in that mode (despite the vslow flag). Vslow mainly removes masking of low-complexity repetitive kmers, which is not usually a problem but can be with extremely short sequences like microRNAs.

ADD COMMENTlink modified 4.7 years ago • written 4.7 years ago by Brian Bushnell17k

This worked great! Thanks for the suggestion.

ADD REPLYlink written 4.7 years ago by Ryan Thompson3.5k

What if I (by which I mean my collaborators) also want to map directly to a database of mature miRNA sequences instead of mapping to the genome? I expect the same options to mostly work, but the one case I'd worry about is when a read is longer than an annotated mature miRNA sequence by one or 2 bases. This would presumably no longer constitute an exact match, even when the overlapping bases match exactly. Is there any way to tell BBMap that overhangs on the end of the reference should be allowed? I see the "semiperfectmode" option, which allows otherwise perfect matches to overhang into N's in the reference, but it's not clear if this option also allows overhangs at the ends of the reference.

ADD REPLYlink modified 4.7 years ago • written 4.7 years ago by Ryan Thompson3.5k

Hi Ryan,

Yes, "semiperfectmode" is for that specific case. Internally, the ends of contigs are padded with Ns for alignment, so going off the end by two bases is equivalent score-wise to having 2 no-calls in the reference. Before the aligned read is printed to the same file those bases are soft-clipped as per the sam specification, so you don't see them in the cigar string, but it is still considered a "semi-perfect" alignment.


ADD REPLYlink written 4.7 years ago by Brian Bushnell17k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1663 users visited in the last hour