Question

Understanding Re-use of Query Sequences in Long Read Alignments

0

Entering edit mode

7 months ago

Charles-Alexandre Roy ▴ 70

Hi all,

I've been testing the alignment of some PacBio reads using pbmm2 and Minimap2 and observed a phenomenon that I can't make sense of.

With both tools, I'm frequently seeing the same reads producing many supplementary alignments (as many as 93 in one case!) in which it appears that many of the same bases of the original read are being reused in the supplementary alignments. For example, here are the CIGAR strings from some pbmm2 alignments:

m84137_250219_010828_s1/255856922/ccs/rev       0       chrA    3219    60      1040S602=1I595=1I931=1X11=3162S
m84137_250219_010828_s1/255856922/ccs/rev       2064    chrA    3219    60      1041S185=1I417=1I1517=3182S

And here are some from a Minimap2 alignment:

m84137_250219_010828_s1/261165604/ccs/fwd       0       chr14   77485520        60      2047S748M4I976M4D459M
m84137_250219_010828_s1/261165604/ccs/fwd       2064    chr14   77485651        60      2178H617M4I976M4D459M

In both cases, it looks like a significant portion of the original read is being re-used in each alignment. In my application, this makes it difficult to accurately quantify the unique read (and base) counts that align to different parts of the genome. (And no, I can't just ignore the supplementary alignments, unfortunately).

As I was trying to figure this out, I came across a section of the pbmm2 GitHub documentation on "repeated matches trimming" that pbmm2 is supposed to perform by default. The documentation states "we see pbmm2 as a blasr replacement and require that a single base may never be aligned twice". The documentation goes on to describe the pbmm2 repeated matches trimming approach as follows: "[w]e align the read, find overlapping query intervals, and trim non-primary alignments in order as provided by minimap2; trimming here means that pbmm2 rewrites the cigar and the reference coordinates on-the-fly."

To me, this seems at odds with what I am observing.

I then created this issue on the Minimap2 page to try to make sense of things. Heng Li told me that in my example, "the overlap length is tiny on the query", that the alignments "are on different strands", and that this "behavior is expected and desired".

To me, the overlap length seems substantial on the query, and this behavior is neither desired nor expected. I assume I'm misunderstanding something but in my thinking, if the same part of a query sequence aligns to multiple places in the reference genome (on either strand), the best-scoring alignment should be called the primary alignment and all others should be labeled secondary alignments.

Can someone help me understand what I'm missing here?

Alignment PacBio pbmm2 Minimap2 • 1.4k views

ADD COMMENT • link updated 7 months ago by LauferVA 4.8k • written 7 months ago by Charles-Alexandre Roy ▴ 70

score 1 · Answer 1 · 2025-03-29

Charles-Alexandre Roy , You're right to wonder - even a very thoughtful approach, like yours, will generate some confusion. I'll indicate some "clues" below.

Clue 1: Secondary Alignments versus Supplementary Alignments In a supplementary alignment, a chimeric alignments read maps to multiple genomic locations or regions due to structural variations like inversions, large insertions/deletions (Flag 2048). In a secondary alignment, the entire read maps well to multiple genomic locations equally (Flag 256). Here, you seem to be describing supplementary alignments (2048). This should indicate to you several things - for instance that you are likely in a repetitive region, or complex SV, etc. This isn't unexpected because PacBio CCS can resolve complex and repetitive regions - and T2T taught us these aren't all that uncommon. (Bonus question: why do we infer this, rather than wondering if the issue stems from CCS reads themselves?)

Clue 2: Study your Stogies (CIGAR Strings) You've observed that the overlaps in your CIGAR strings are large (e.g. 1040S602=...3162S) and made a reasonable inference that these bases are being repeatedly aligned. However, the bases marked as clipped ('S' or 'H') aren't actually being realigned; they're probably being shown as clipped from the alignment to indicate they're not aligned at that position.

Charles-Alexandre Roy this more than anything else is probably the key point ... In the CIGAR string, 'S' (soft clipping) means those query bases exist but are not part of this specific alignment line. Almost everything else in this post flows from that, really ...

Understanding Heng Li's answers

As a corollary, I suspect this may also relate to your confusion regarding Heng Li's answers. Visually these overlaps might appear substantial, the truly aligned (not clipped) segments usually have very minimal overlaps or none at all. In other words, though visually it seems otherwise, behind-the-scences pbmm2 ensures the same base isn't genuinely aligned more than once (i.e. performs "repeated match trimming" & trims overlapping alignments on-the-fly). When Heng says the overlap is "tiny," he's referring to these small regions of aligned, non-softclipped bases, which may align to the other strand, or a slightly shifted regions due to repetitive sequences or partial adapters.

Way Forward: Isolate primary and supplementary alignments

Use samtools view with flags to isolate primary and supplementary alignments clearly. Parse aligned regions precisely with bioinformatics libraries like pysam. Explicitly remove duplicate bases by tracking already-counted query bases. Consider alignment post-processing utilities (bedtools intersect) to handle overlaps systematically. Verify your results visually using genome browsers like IGV for sanity checks.

Thanks for a thoughtful and timely question - I expect that many folks will benefit from this over time. Welcome additions/improvements from others (e.g. if this isn't current).