Question

Estimate large insertions size in WGS data.

0

Entering edit mode

16 months ago

meetmet ▴ 10

Hello,

I'm currently doing my master’s internship in a genomic diagnostic lab. I was tasked with building a Snakemake workflow to detect large insertions in whole-genome sequencing (WGS) data within fragile sites.

My question is regarding the final output of my Snakemake workflow, which is a BED file containing all putative insertions (INS) larger than 50 base pairs. I would like to add an additional annotation to this BED file for the inserted sequence size at each INS breakpoint, but I’m unsure if my algorithm makes sense.

Here’s my idea:

For each breakpoint, use samtools to extract discordant reads with mates mapping to another chromosome.
Check if there is a consensus chromosome origin for these discordant mate reads.
If a consensus chromosome is identified, extract the forward and reverse out-mapping discordant mate reads on this chromosome.
Estimate the insertion length by subtracting the position of the rightmost forward read from the position of the leftmost reverse read.

Does this approach make sense?

                                                        ILLUSTATION

Reference genome:

-------------------------------------| bp |---------------------------------------

r1______________--------

r2 _________________________ --------

Actual inserted sequence:

_start

r1-2 __ Rev |---------------

r2-2 __________________________--------------| Forv

_stop

Fomula: size_ins = stop - start

Of course the insertes sequence should come from the same genome.

Cheers.

NGS Diagnostic SV WGS • 586 views

ADD COMMENT • link updated 16 months ago by GenoMax 154k • written 16 months ago by meetmet ▴ 10

1

Entering edit mode

Can you format the Illustration a lot better than what is there now? This may be one instance where uploading an image may be significantly better than trying to format this as code in Biostars edit window.

ADD REPLY • link 16 months ago by GenoMax 154k