I have processed Bowtie 2 alignments by filtering for MAPQ greater than or equal to 1, which retains a subset of multi-mapping alignments (those with up to five mismatches) while excluding the rest. This means my data include some multi-mappers.
To estimate effective genome size, I'm considering one of two common methods:
- Counting non-N bases in the reference genome—commonly used when all multi-mapping alignments are retained (i.e., when no MAPQ filtering is applied).
- Estimating the number of unique k-mers (where k matches the read length), which approximates the number of uniquely alignable bases and is used when multi-mappers are excluded.
Since my data retain a subset of multi-mappers but exclude the least confident alignments, should I use option 1, as multi-mappers are present? Or does the filtering bias my data toward higher complexity regions, making option 2 the more appropriate choice?
Thanks—yes.
The resources you posted distinguish cases where multi-mappers are included (no MAPQ filtering) versus excluded (MAPQ filtering). However, I’m asking about an edge case where only a subset of multi-mappers is excluded via MAPQ filtering. My default approach has been to approximate the mappable genome using non-N bases (as in option 1 above and in the posted resources), but is that the correct choice for this specific edge case?
It seems like it is, but feedback along with an explanation would be helpful.
How much data is being excluded because of this? If this is a small fraction then my instinct would be to not worry about it. If it is a large fraction then sticking with option 2 may be a better choice.
Are you working with a genome that does not have a good reference available and/or few genome sequences available? Was there anything special done when these libraries were made? Is the reference in use for the exact organism?