Hi all,
I've been testing the alignment of some PacBio reads using pbmm2 and Minimap2 and observed a phenomenon that I can't make sense of.
With both tools, I'm frequently seeing the same reads producing many supplementary alignments (as many as 93 in one case!) in which it appears that many of the same bases of the original read are being reused in the supplementary alignments. For example, here are the CIGAR strings from some pbmm2 alignments:
m84137_250219_010828_s1/255856922/ccs/rev       0       chrA    3219    60      1040S602=1I595=1I931=1X11=3162S
m84137_250219_010828_s1/255856922/ccs/rev       2064    chrA    3219    60      1041S185=1I417=1I1517=3182S
And here are some from a Minimap2 alignment:
m84137_250219_010828_s1/261165604/ccs/fwd       0       chr14   77485520        60      2047S748M4I976M4D459M
m84137_250219_010828_s1/261165604/ccs/fwd       2064    chr14   77485651        60      2178H617M4I976M4D459M
In both cases, it looks like a significant portion of the original read is being re-used in each alignment. In my application, this makes it difficult to accurately quantify the unique read (and base) counts that align to different parts of the genome. (And no, I can't just ignore the supplementary alignments, unfortunately).
As I was trying to figure this out, I came across a section of the pbmm2 GitHub documentation on "repeated matches trimming" that pbmm2 is supposed to perform by default. The documentation states "we see pbmm2 as a blasr replacement and require that a single base may never be aligned twice". The documentation goes on to describe the pbmm2 repeated matches trimming approach as follows: "[w]e align the read, find overlapping query intervals, and trim non-primary alignments in order as provided by minimap2; trimming here means that pbmm2 rewrites the cigar and the reference coordinates on-the-fly."
To me, this seems at odds with what I am observing.
I then created this issue on the Minimap2 page to try to make sense of things. Heng Li told me that in my example, "the overlap length is tiny on the query", that the alignments "are on different strands", and that this "behavior is expected and desired".
To me, the overlap length seems substantial on the query, and this behavior is neither desired nor expected. I assume I'm misunderstanding something but in my thinking, if the same part of a query sequence aligns to multiple places in the reference genome (on either strand), the best-scoring alignment should be called the primary alignment and all others should be labeled secondary alignments.
Can someone help me understand what I'm missing here?