How to determine gap length of aligned segments for a given read?
2
2
Entering edit mode
5.2 years ago
cnvspam ▴ 80

Say I have a read with two supplementary alignments. How do I infer the gap length of the two aligned segments relative to the read?

In the example above, the gap length is the size of the insertion (orange thin line). I'm confused as heck as how to calculate the gap length from reads in a BAM file. Would the CIGAR string be informative, or do I have to go back to the original FASTQ?

sequencing alignment split • 2.2k views
0
Entering edit mode

In this kind of situation, I think unsplit alignments are easier to use, since that information is captured in the cigar string. BBMap will produce unsplit alignments supporting long deletions and medium-length insertions (you can never capture a long insertion in a short read, using split or unsplit alignments).

1
Entering edit mode
5.2 years ago

First, make sure the get the terminology right. Distances between supplementary alignments are not usually called gaps.

The gaps are usually mentioned relative to an aligned sequence. For any individual alignment, the CIGAR string has that information as a count followed by either of the letters D or I. 5 deletions would be 5D , 10 insertions would be written as 10I of the read relative to a reference.

Once you get to supplementary alignments you have to decide what you wish to call the "gap". For both reads the leftmost position will be listed in the POS column. Then for both reads you can compute the alignment length (that is where the end position is). Once you have these you can figure out whatever distance you want to compute (end to end, end to start etc).

That distance is still not a gap though - it is just a distance that you might interpret as a gap based on some other rationale. Usually, gap means missing, so you would need to state why you think anything should be there to being with.

0
Entering edit mode

HI thanks for the response. I'm using the terminology the authors of the figure used above. I realize that "gap" is not the best word to describe.

What you described, taking the left and right most positions of each aligned segment and determining the distance between the two, is described by the authors as "reference length"

Insertions and deletions are discerned based on the relation between the gap size g [the thin orange line in the figure above, which is the size of the insert], and the reference length L = | left position segment 1 - right position segment 2 |

Am I crazy to think that you can figure out the positioning of segments on the original fragment? The authors of the paper seem to have done it.

0
Entering edit mode
5.2 years ago

The image above appears to suggest that the "gap" in this definition is the distance between the rightmost coordinate of one alignment to the leftmost coordinate of the other alignment.

The easiest way to compute that would be to transform your BAM file into the BED format. You can do that with bedtools

The BED file will list the start and end coordinates of each alignment. You can then find the reads with identical names and compute the distance as described above. Depending on how many such reads there are you can do it by hand or you'd probably need to write a simple script to process the data.