Question

How to count the number of multimapped reads from a bam file?

0

Entering edit mode

22 months ago

O.rka ▴ 710

Here's my bowtie2 & featurecounts commands:

( bowtie2 -x output/reference.fa.gz -1 cleaned_1.fastq.gz -2 cleaned_2.fastq.gz --threads 2 --seed 0 --met-file bowtie2/metrics.txt --no-unal --un-conc-gz bowtie2/unmapped_%.fastq.gz  ) | ( samtools sort --threads 2 --reference output/reference.fa.gz -T tmp/samtools_sort > bowtie2/mapped.sorted.bam )

featureCounts -a output/reference.gff -o featurecounts/featurecounts.orfs.tsv -F GTF --tmpDir tmp -T 2 -g gene_id -t CDS bowtie2/mapped.sorted.bam

Here's my featurecounts summary:

Assigned    2506757
Unassigned_Unmapped 0
Unassigned_Read_Type    0
Unassigned_Singleton    0
Unassigned_MappingQuality   0
Unassigned_Chimera  0
Unassigned_FragmentLength   0
Unassigned_Duplicate    0
Unassigned_MultiMapping 0
Unassigned_Secondary    0
Unassigned_NonSplit 0
Unassigned_NoFeatures   202251
Unassigned_Overlapping_Length   0
Unassigned_Ambiguity    189598

It says there's no multimapped reads for all my samples but there's a lot of different strains of the same species in there so I would expect there to be at least 1 read that's multimapped in the entire dataset which makes me think I'm doing something incorrect.

Is there an association between Unassigned_Ambiguity and Unassigned_MultiMapping?

A multi-mapping read is a read that maps to more than one location in the reference genome. There are multiple options for counting such reads.

When assigning reads to genes or exons, most reads can be successfully assigned without ambiguity. However if reads are to be assigned to transcripts, due to the high overlap between transcripts from the same gene, many reads will be found to overlap more than one transcript and therefore cannot be uniquely assigned.

https://bioconductor.org/packages/release/bioc/vignettes/Rsubread/inst/doc/SubreadUsersGuide.pdf

rnaseq mapping bowtie2 featurecounts • 1.0k views

ADD COMMENT • link updated 22 months ago by Shred ★ 1.4k • written 22 months ago by O.rka ▴ 710

score 0 · Answer 1 · 2022-06-27

Your question title doesn't fit properly the question body: please fix it. You're not counting multimapping reads based on your command syntax. From FeatureCounts manual

By default, featureCounts does not count multi-overlapping reads. Users can specify the ‘-O’ option (set allowMultiOverlap to TRUE in R) to fully count them for each overlapping metafeature/feature (each overlapping meta-feature/feature receives a count of 1 from a read), or specify both ‘-O’ and ‘–fraction’ options (set both allowMultiOverlap and fraction to TRUE in R) to assign a fractional count to each overlapping meta-feature/feature (each overlapping meta-feature/feature receives a count of 1/y from a read where y is the total number of meta-features/features overlapping with the read).

In your case:

Unassigned MultiMapping: alignments reported for multi-mapping reads (indicated by ‘NH’ tag).

Those are reads where the multimapping events is identified by the 'NH' tag. Note that not every aligner appends this tag to the alignment.

Unassigned Ambiguity: alignments that overlap two or more features (feature-level summarization) or meta-features (meta-feature-level summarization).

Two overlapping features. This is a fairly common scenario, especially with pseudogenes or lncRNA. It's somewhat your choice how to account for those reads, and it strongly depends on experimental setup and/or objectives.

Closing, to extract multimapping reads from your bam file (check also this post How to filter/view reads mapped to multiple locations using samtools or pysam) you could filter for anything with MAPQ < x, where x is a value between 0-5. There's no consensus between community, so pick it at your own risk.