Question

Estimating effective genome size when retaining a higher-confidence subset of multi-mappers in alignment data

0

Entering edit mode

5 months ago

kalavattam ▴ 350

I have processed Bowtie 2 alignments by filtering for MAPQ greater than or equal to 1, which retains a subset of multi-mapping alignments (those with up to five mismatches) while excluding the rest. This means my data include some multi-mappers.

To estimate effective genome size, I'm considering one of two common methods:

Counting non-N bases in the reference genome—commonly used when all multi-mapping alignments are retained (i.e., when no MAPQ filtering is applied).
Estimating the number of unique k-mers (where k matches the read length), which approximates the number of uniquely alignable bases and is used when multi-mappers are excluded.

Since my data retain a subset of multi-mappers but exclude the least confident alignments, should I use option 1, as multi-mappers are present? Or does the filtering bias my data toward higher complexity regions, making option 2 the more appropriate choice?

multi-mapper effective-genome-size complexity mapq • 1.5k views

ADD COMMENT • link updated 5 months ago by GenoMax 152k • written 5 months ago by kalavattam ▴ 350

score 0 · Answer 1 · 2025-02-02

0

Entering edit mode

5 months ago

GenoMax 152k

Estimating effective genome size

Just to confirm that you intend this to mean the "mappable" genome?

This past thread should be useful: How do I compute the effective genome size?

deepTools has a page on this calculation: https://deeptools.readthedocs.io/en/develop/content/feature/effectiveGenomeSize.html

ADD COMMENT • link 5 months ago by GenoMax 152k

0

Entering edit mode

Just to confirm that you intend this to mean the "mappable" genome?

Thanks—yes.

The resources you posted distinguish cases where multi-mappers are included (no MAPQ filtering) versus excluded (MAPQ filtering). However, I’m asking about an edge case where only a subset of multi-mappers is excluded via MAPQ filtering. My default approach has been to approximate the mappable genome using non-N bases (as in option 1 above and in the posted resources), but is that the correct choice for this specific edge case?

ADD REPLY • link 5 months ago by kalavattam ▴ 350

0

Entering edit mode

...but is that the correct choice for this specific edge case?

It seems like it is, but feedback along with an explanation would be helpful.

ADD REPLY • link 5 months ago by kalavattam ▴ 350

0

Entering edit mode

I’m asking about an edge case where only a subset of multi-mappers is excluded via MAPQ filtering.

How much data is being excluded because of this? If this is a small fraction then my instinct would be to not worry about it. If it is a large fraction then sticking with option 2 may be a better choice.

Are you working with a genome that does not have a good reference available and/or few genome sequences available? Was there anything special done when these libraries were made? Is the reference in use for the exact organism?

ADD REPLY • link 5 months ago by GenoMax 152k