A student and me are testing the mirDeep2 pipeline on the Drosophila genome. mirDeep2 is using bowtie internally but we got a relatively low mapping rate and therefore also fed the pipeline with an alignment generated by BWA. BWA yielded a larger fraction of aligned reads, and naively I would assume that this should also lead to a larger number of detected miRNAs, but the opposite seems to be true. Why could this be the case? We have used public data only, supporting information below.
Sample: SRR019717 (Drosophila melanogaster), downloaded from SRA Reads were trimmed using Trimmomatic, min length 18. Alignment rate: Total = 5,265,951 reads after trimming Aligned (Bowtie) = 3,090,394 (58.69%) reads Aligned (BWA) = 4,644,269 (88.19%) reads Result: In species: 438 Novel (Bowtie): 25 miRNAs Known (Bowtie): 105 miRNAs In data (Bowtie, at least one read mapped back): 198 Novel (BWA): 33 miRNAs Known (BWA): 79 miRNAs In data (BWA): 135
Update: we calculated the overlap between bowtie and bwa aln alignments using bx-python:
Note: the observed overlap Bowtie <-> BWA is vastly asymmetric. A large proportion of Bowtie alignments is also covered at least once by BWA, but BWA also seems to cover a large number of locations that Bowtie does not cover. This might be explained by the hypothesis that BWA alignments are more uniformly spread out over the genome, while Bowtie might generate more localized stacks of reads.
bed_intersect_basewise.py SRR019717_trimmed_bowtie.bed SRR019717_trimmed_bwa.bed >~/baseoverlap.bed; wc -l ~/baseoverlap.bed **215,130** bed_intersect.py SRR019717_trimmed_bowtie.bed SRR019717_trimmed_bwa.bed >~/overlap.bed **3,067,089** bed_intersect.py SRR019717_trimmed_bwa.bed SRR019717_trimmed_bowtie.bed >~/overlap2.bed **2,596,771**
A precision-recall plot using the output of mirDeep2 at different cutoffs between -10..10, making somewhat arbitrary assumptions, that all novel miRNAs are false positives, and that there are 438 miRNA's known in mirBase as reported by mirDeep2.
Edit: We have several candidates, but these need to be checked carefully:
- Absurd coverage: if there are really only ~500 miRNA of length ~100bp including precursor, even only 1M reads yield already > 500X coverage of the miRNA-ome. Maybe there is an upper limit for local coverage in the pipeline?
- Multi-mappers: More mapped reads might also yield more multi-mapped reads, maybe there is a filter "upper limit multi-mapping" in the pipeline, especially if multi-mapping is to protein coding transcripts?
- Pipeline possibly uses a specific feature of the SAM output of Bowtie?
- Most miRNAs in mirBase from Drosophila where possibly predicted using mirDeep with Bowtie?