Calling TE sequences from VCF files
2.5 years ago
elcortegano ▴ 140

I have run the software sniffles to call structural variants from Pacbio sequences. In the resultant VCF file, most of entries in the ALT field look like:

N[chromosome_3:7199420[


In the example above, this is actually for a variant in a different chromosome, CHR=chromosome_1 and POS=270281, so I guess this is a transponible element coming from chromosome 3 that is present in that location.

I am not familiarized with this format for the ALT field, and was wondering if there is a straightforward way to get the sequence for that element (or any other structural variant found in the VCF). Any ideas?

next-gen sniffles variant_calling
13 months ago
Shunhua ▴ 20

What you saw is a BND record, which represents arbitrary rearrangement event with 2 break ends. The t[p[ format represents “piece extending to the right of p is joined after t” (see details in https://samtools.github.io/hts-specs/VCFv4.2.pdf).

In Sniffles, this is likely a translocation event that might or might not involve transposable element. To extract SV sequence, you can use -n -1 option to have Sniffles output all SV-supporting reads for each SV entry in the VCF, then you can find read IDs under RNAMES=. The first ID usually represents the "primary SV read" that contain representative sequence.

If you are using long read data and are interested in getting all non-reference transposable element sequences based on Sniffles output, you can use TELR (https://github.com/bergmanlab/TELR) that will run Sniffles, find candidate TE loci, and report their sequences based on a local assembly strategy.