Estimating effective genome size when retaining a higher-confidence subset of multi-mappers in alignment data
1
0
Entering edit mode
12 days ago
kalavattam ▴ 310

I have processed Bowtie 2 alignments by filtering for MAPQ greater than or equal to 1, which retains a subset of multi-mapping alignments (those with up to five mismatches) while excluding the rest. This means my data include some multi-mappers.

To estimate effective genome size, I'm considering one of two common methods:

  1. Counting non-N bases in the reference genome—commonly used when all multi-mapping alignments are retained (i.e., when no MAPQ filtering is applied).
  2. Estimating the number of unique k-mers (where k matches the read length), which approximates the number of uniquely alignable bases and is used when multi-mappers are excluded.

Since my data retain a subset of multi-mappers but exclude the least confident alignments, should I use option 1, as multi-mappers are present? Or does the filtering bias my data toward higher complexity regions, making option 2 the more appropriate choice?

multi-mapper effective-genome-size complexity mapq • 1.0k views
ADD COMMENT
0
Entering edit mode
12 days ago
GenoMax 149k

Estimating effective genome size

Just to confirm that you intend this to mean the "mappable" genome?

This past thread should be useful: How do I compute the effective genome size?

deepTools has a page on this calculation: https://deeptools.readthedocs.io/en/develop/content/feature/effectiveGenomeSize.html

ADD COMMENT
0
Entering edit mode

Just to confirm that you intend this to mean the "mappable" genome?

Thanks—yes.

The resources you posted distinguish cases where multi-mappers are included (no MAPQ filtering) versus excluded (MAPQ filtering). However, I’m asking about an edge case where only a subset of multi-mappers is excluded via MAPQ filtering. My default approach has been to approximate the mappable genome using non-N bases (as in option 1 above and in the posted resources), but is that the correct choice for this specific edge case?

ADD REPLY
0
Entering edit mode

...but is that the correct choice for this specific edge case?

It seems like it is, but feedback along with an explanation would be helpful.

ADD REPLY
0
Entering edit mode

I’m asking about an edge case where only a subset of multi-mappers is excluded via MAPQ filtering.

How much data is being excluded because of this? If this is a small fraction then my instinct would be to not worry about it. If it is a large fraction then sticking with option 2 may be a better choice.

Are you working with a genome that does not have a good reference available and/or few genome sequences available? Was there anything special done when these libraries were made? Is the reference in use for the exact organism?

ADD REPLY

Login before adding your answer.

Traffic: 2683 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6