Question: mirDeep2 using bowtie vs. bwa - why do more aligned reads yield less miRNA
3
gravatar for Michael Dondrup
3.4 years ago by
Bergen, Norway
Michael Dondrup47k wrote:

A student and me are testing the mirDeep2 pipeline on the Drosophila genome. mirDeep2 is using bowtie internally but we got a relatively low mapping rate and therefore also fed the pipeline with an alignment generated by BWA. BWA yielded a larger fraction of aligned reads, and naively I would assume that this should also lead to a larger number of detected miRNAs, but the opposite seems to be true. Why could this be the case? We have used public data only, supporting information below.

Sample: SRR019717 (Drosophila melanogaster), downloaded from SRA
Reads were trimmed using Trimmomatic, min length 18.

Alignment rate:
Total = 5,265,951 reads after trimming
Aligned (Bowtie) = 3,090,394 (58.69%) reads
Aligned (BWA) = 4,644,269 (88.19%) reads

Result:
In species: 438
Novel (Bowtie): 25 miRNAs
Known (Bowtie): 105 miRNAs
In data (Bowtie, at least one read mapped back): 198
Novel (BWA): 33 miRNAs
Known (BWA): 79 miRNAs 
In data (BWA): 135

Update: we calculated the overlap between bowtie and bwa aln alignments using bx-python:

Note: the observed overlap Bowtie <-> BWA is vastly asymmetric. A large proportion of Bowtie alignments is also covered at least once by BWA, but BWA also seems to cover a large number of locations that Bowtie does not cover. This might be explained by the hypothesis that BWA alignments are more uniformly spread out over the genome, while Bowtie might generate more localized stacks of reads.

bed_intersect_basewise.py SRR019717_trimmed_bowtie.bed SRR019717_trimmed_bwa.bed >~/baseoverlap.bed; wc -l ~/baseoverlap.bed
**215,130**
bed_intersect.py SRR019717_trimmed_bowtie.bed SRR019717_trimmed_bwa.bed >~/overlap.bed
**3,067,089**    
bed_intersect.py SRR019717_trimmed_bwa.bed SRR019717_trimmed_bowtie.bed >~/overlap2.bed
**2,596,771**

A precision-recall plot using the output of mirDeep2 at different cutoffs between -10..10, making somewhat arbitrary assumptions, that all novel miRNAs are false positives, and that there are 438 miRNA's known in mirBase as reported by mirDeep2.

Precision recall plot based on mirDeep output

mirdeep2-bowtie.html mirdeep2-bwa.html

Possible culprits

Edit: We have several candidates, but these need to be checked carefully:

  • Absurd coverage: if there are really only ~500 miRNA of length ~100bp including precursor, even only 1M reads yield already > 500X coverage of the miRNA-ome. Maybe there is an upper limit for local coverage in the pipeline?
  • Multi-mappers: More mapped reads might also yield more multi-mapped reads, maybe there is a filter "upper limit multi-mapping" in the pipeline, especially if multi-mapping is to protein coding transcripts?
  • Pipeline possibly uses a specific feature of the SAM output of Bowtie?
  • Most miRNAs in mirBase from Drosophila where possibly predicted using mirDeep with Bowtie?
mirdeep2 • 2.2k views
ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by Michael Dondrup47k
2

I find a previous post which may be interesting to look at.

ADD REPLYlink written 3.4 years ago by GZ1995350

Thank you, we will try to boost bowtie parameters to reach a higher alignment rate. Then we will check if the recall increases or drops.

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by Michael Dondrup47k

I would wonder about the overlap between the two aligners.

ADD REPLYlink written 3.4 years ago by WouterDeCoster43k

We have calculated the intersections between the two aligners.

ADD REPLYlink written 3.4 years ago by Michael Dondrup47k
4
gravatar for Michael Dondrup
3.4 years ago by
Bergen, Norway
Michael Dondrup47k wrote:

Think we are approaching a first solution to the mirDeep paradox.

Differences in SAM output

Most likely the pipeline is adapted to the peculiarities of the output format of bowtie. Look at the differences for a random multi-mapping read:

Bowtie:
SRR019717.35    16      2R_dna:chromosome_chromosome:BDGP6:2R:1:25286936:1_REF  5049582 255     25M     *       0       0       AGCTTTCCGCTGCCAGGCATTCTTC       ;;;;;;;;;;;;;;;;;;;;;;;;;       XA:i:0  MD:Z:25 NM:i:0
SRR019717.35    16      2L_dna:chromosome_chromosome:BDGP6:2L:1:23513712:1_REF  1302172 255     25M     *       0       0       AGCTTTCCGCTGCCAGGCATTCTTC       ;;;;;;;;;;;;;;;;;;;;;;;;;       XA:i:0  MD:Z:25 NM:i:0
SRR019717.35    16      2R_dna:chromosome_chromosome:BDGP6:2R:1:25286936:1_REF  5056566 255     25M     *       0       0       AGCTTTCCGCTGCCAGGCATTCTTC       ;;;;;;;;;;;;;;;;;;;;;;;;;       XA:i:0  MD:Z:25 NM:i:0

BWA:    
SRR019717.35    16      2R_dna:chromosome_chromosome:BDGP6:2R:1:25286936:1_REF  5056566 0       25M     *       0       0       AGCTTTCCGCTGCCAGGCATTCTTC       ;;;;;;;;;;;;;;;;;;;;;;;;;       XT:A:R  NM:i:0  X0:i:3X1:i:18  XM:i:0  XO:i:0  XG:i:0  MD:Z:25

While Bowtie outputs one line for each hit, BWA chooses one of the mapping positions and uses the X0 tag to report that there are multiple mappings. However the average number of alignments per aligned read in Bowtie is 1.39, maximum alignments per read are capped by 5, otherwise it is reported as unaligned. In our BWA run, there was no cap on the number of alignments, but the pipeline has no way to find alternative alignments, reducing the coverage. If we assume, that most real miRNA most probably come in 2-5 copies, these have their coverage massively reduced by counting only a single random alignment.

Other differences:

  • Mapping quality, in Bowtie it is always 255 per mapped read, in BWA it is 16.48 on average (here 0)
  • Tags: BWA has more tags, but is lacking the XA:i: tag

Conclusion One cannot use BWA as a drop in replacement for Bowtie in mirDeep2! Other aligners might work, but this has to be checked.

ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by Michael Dondrup47k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1163 users visited in the last hour