Identifying breakpoints of split reads in SAM files
1
0
Entering edit mode
2.2 years ago
William • 0

I've been trying to detect SV signals in each sample by going through the reads at each potential SV location, but I've come across an issue when trying to identify breakpoints for inversions. I was wondering if anyone could tell me how one can use the information for split reads in a SAM file to identify breakpoints, particularly when the two splits map to opposite strands. Put another way, what information is available to take a read that has been split to determine how the two splits could be reassembled (or the orientation of that reassembly)?

For example, let's say the unaligned read looks like this:

AAAATTTT

and is split and mapped like this:

AAAA                     TTTT
----------------------------->

what information is available to determine if the original read was AAAATTTT or TTTTAAAA?

I was under the impression that the leading soft-clipped bases for splits held this information, but this seems to not be the case for splits that are on opposite strands. But, I could be wrong and hoping I'm missing something here.

As a practical case, below are two reads that are split at a potential inversion breakpoint, and I'm trying to figure out if the breakpoint is on the rightmost or leftmost position of the two alignments (excuse the emoji code in the query name):

A00434💯HM75NDMXX:2:1265:1497:27445 147 AaegL5_1 97039982 60 105M45S = 97039883 -204 AAACAAACGCCGATAAGACCCTGATCGACTCGGAACTACAATCTGTTGCGCTTTCTTCACAAACAATGGACCCACAACAGTTGGTGAGGCGCACTGGGAGGGAGCAAGTGCAACACGCTAAGAACTGGAGTCCTCCTAGCTAGTAGGAGG FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFF:FFFFFF SA:Z:AaegL5_1,97041467,+,47M103S,60,0; MC:Z:150M MD:Z:105 RG:Z:HM75NDMXX.2.1101 NM:i:0 AS:i:105 XS:i:0

A00434💯HM75NDMXX:2:1265:1497:27445 2179 AaegL5_1 97041467 60 47M103H = 97039883 -1585 CCTCCTACTAGCTAGGAGGACTCCAGTTCTTAGCGTGTTGCACTTGC FFFFFF:FFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF SA:Z:AaegL5_1,97039982,-,105M45S,60,0; MC:Z:150M MD:Z:47 RG:Z:HM75NDMXX.2.1101 NM:i:0 AS:i:47 XS:i:0

Any and all information would be greatly appreciated!

-Will

SV alignment SAM DNA • 842 views
ADD COMMENT
1
Entering edit mode
2.1 years ago
d-cameron ★ 2.9k

I've been trying to detect SV signals in each sample by going through the reads at each potential SV location

Is there a reason you're using using a SV caller (e.g. GRIDSS, manta, delly, lumpy, etc) to do this?

Put another way, what information is available to take a read that has been split to determine how the two splits

That information is encoded in the SAM flags and CIGAR fields. I strongly recommend reading the SAM file format specifications https://samtools.github.io/hts-specs/SAMv1.pdf. Split reads are called 'chimeric' read in the specifications.

what information is available to determine if the original read was AAAATTTT or TTTTAAAA?

If the 0x10 negative strand bit is set in the flags field for that record, then the actual read sequence is the reverse complement of the sequence of the SAM record.

But, I could be wrong and hoping I'm missing something here.

You are indeed. You need to look at both the flags field and the CIGAR.

As a practical case, below are two reads

That looks like you used bwa for alignment. It uses the convention that the primary read alignment contains the full read sequence, and the supplementary read sequence only contains the sequence of the mapped based (H = Hard clipped).

So, the breakpoint position implied by that split read is:

First read: 97039982+105-1 with the breakpoint after the aligned bases

Second read: 97041467+47-1 with the breakpoint after the aligned bases

But the CIGARs don't match (it's not 105M45S and 45S105M) so it looks like you have a 2bp microhomology at your breakpoint position and bwa has over-aligned the splits. You have to adjust one of those positions 2bp to the left (your choice - you'll get the same sequence either way - see Figure 9 of the VCF specifications for an example of what I'm talking about https://samtools.github.io/hts-specs/VCFv4.3.pdf).

A SV caller such as GRIDSS will handle all of this complexity for you and will report these microhomologies in it's output VCF.

a potential inversion breakpoint

If it's actually an inversion, you'll expect to see two breakpoints: one in +/+ orientation, and one in -/- orientation. If you only see one breakpoint, your rearrangement is probably more complicated than a simple inversion.

ADD COMMENT

Login before adding your answer.

Traffic: 2257 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6