1
0
Entering edit mode
14 months ago
ncxhit • 0

Hi everyone, I am new to bioinformatics and I was reading a review article on how to find structure variation using pair-end sequencing. While I understand what it means fine (if there is a deletion the distance between two read pair would decline and so on) I am struggling to write a program to find these "SV indicating" reads. Here's what I have done.

1. align my reads to the reference genome using BWA-mem
2. Using the following python code to find read pairs with either distance too big/small or hanging pair or alignment quality too small or bwa thinks it's not a proper match.
if (np.abs(read1_pos - read_mate_pos) > read_distance_big_threshold):
# print("Distance Too Big")
# print("Hanging Pair")
# print("Orientation!")
# print("Distance Too Small")
# print("Not Proper Pair")
# print("Quality Too Low")


The normal distance of my pair end reads is 500bp and the read_distance_big_threshold is 600 while read_distance_small_threshold is 400.

My problem is I am having tons of false positives meaning I have found a lot of "non SV indicating reads" in my way. So my question is "Is there any existing tools that can do this kind of thing for me? "

Specificly I am looking for a tool to find all the reads indicated in this graph

Structure Variation pair-end sequencing • 260 views
0
Entering edit mode

Is there any existing tools that can do this kind of thing for me?

i'm pretty sure that the article cites those tools....

0
Entering edit mode

I wouldn't be using 10 year old bioinformatics software unless it's been constantly updated over that period. BreakDancer is a perfect example of such a highly cited unmaintained SV caller. It performs terribly on read length is over 50bp and pretty much every tool is benchmarked against it and, unsurprisingly, everyone outperforms BreakDancer.

1
Entering edit mode
14 months ago
d-cameron ★ 2.3k

Is there any existing tools that can do this kind of thing for me?

There are many such tools. When I did a lit review for my thesis on SV calling 5 years ago, there were around ~40 general purpose SV callers. There's at least 70 now. I recommend chosing one of the callers that perform well (in data similar to yours, on the types of variants you want to call) in one of the recent benchmarking studies:

Comprehensive evaluation and characterisation of short read general-purpose structural variant calling software.

Comprehensive evaluation of structural variation detection algorithms for whole genome sequencing.

0
Entering edit mode

FYI: you can't trust that flag to be set correctly by the aligner. Some aligners set it to true if the chromosome and orientation is correct, regardless of the distance between the read pairs.