Hello,
I'm currently doing my master’s internship in a genomic diagnostic lab. I was tasked with building a Snakemake workflow to detect large insertions in whole-genome sequencing (WGS) data within fragile sites.
My question is regarding the final output of my Snakemake workflow, which is a BED file containing all putative insertions (INS) larger than 50 base pairs. I would like to add an additional annotation to this BED file for the inserted sequence size at each INS breakpoint, but I’m unsure if my algorithm makes sense.
Here’s my idea:
- For each breakpoint, use samtools to extract discordant reads with mates mapping to another chromosome.
- Check if there is a consensus chromosome origin for these discordant mate reads.
- If a consensus chromosome is identified, extract the forward and reverse out-mapping discordant mate reads on this chromosome.
- Estimate the insertion length by subtracting the position of the rightmost forward read from the position of the leftmost reverse read.
Does this approach make sense?
ILLUSTATION
Reference genome:
-------------------------------------| bp |---------------------------------------
r1______________--------
r2 _________________________ --------
Actual inserted sequence:
_start
r1-2 __ Rev |---------------
r2-2 __________________________--------------| Forv
_stop
Fomula: size_ins = stop - start
Of course the insertes sequence should come from the same genome.
Cheers.
Can you format the Illustration a lot better than what is there now? This may be one instance where uploading an image may be significantly better than trying to format this as
code
in Biostars edit window.