Question: Extract SV indicating pair-end reads
gravatar for ncxhit
3 months ago by
ncxhit0 wrote:

Hi everyone, I am new to bioinformatics and I was reading a review article on how to find structure variation using pair-end sequencing. While I understand what it means fine (if there is a deletion the distance between two read pair would decline and so on) I am struggling to write a program to find these "SV indicating" reads. Here's what I have done.

  1. align my reads to the reference genome using BWA-mem
  2. Using the following python code to find read pairs with either distance too big/small or hanging pair or alignment quality too small or bwa thinks it's not a proper match.
if (np.abs(read1_pos - read_mate_pos) > read_distance_big_threshold):
    output_pair_end_fastq(read, read_mate)
    # print("Distance Too Big")
elif read.mate_is_unmapped:
    output_pair_end_fastq(read, read_mate)
    # print("Hanging Pair")
elif (read.is_reverse == read.mate_is_reverse):
    output_pair_end_fastq(read, read_mate)
    # print("Orientation!")
elif np.abs(read_mate_pos - read1_pos) < read_distance_small_threshold:
    output_pair_end_fastq(read, read_mate)
    # print("Distance Too Small")
elif not read.is_proper_pair:
    output_pair_end_fastq(read, read_mate)
    # print("Not Proper Pair")
elif read.mapping_quality < read_mapping_quality_threshold:
    output_pair_end_fastq(read, read_mate)
    # print("Quality Too Low")

The normal distance of my pair end reads is 500bp and the read_distance_big_threshold is 600 while read_distance_small_threshold is 400.

My problem is I am having tons of false positives meaning I have found a lot of "non SV indicating reads" in my way. So my question is "Is there any existing tools that can do this kind of thing for me? "

Specificly I am looking for a tool to find all the reads indicated in this graph

ADD COMMENTlink modified 12 weeks ago by d-cameron2.2k • written 3 months ago by ncxhit0

Is there any existing tools that can do this kind of thing for me?

i'm pretty sure that the article cites those tools....

ADD REPLYlink written 3 months ago by Pierre Lindenbaum129k

I wouldn't be using 10 year old bioinformatics software unless it's been constantly updated over that period. BreakDancer is a perfect example of such a highly cited unmaintained SV caller. It performs terribly on read length is over 50bp and pretty much every tool is benchmarked against it and, unsurprisingly, everyone outperforms BreakDancer.

ADD REPLYlink written 12 weeks ago by d-cameron2.2k
gravatar for d-cameron
12 weeks ago by
d-cameron2.2k wrote:

Is there any existing tools that can do this kind of thing for me?

There are many such tools. When I did a lit review for my thesis on SV calling 5 years ago, there were around ~40 general purpose SV callers. There's at least 70 now. I recommend chosing one of the callers that perform well (in data similar to yours, on the types of variants you want to call) in one of the recent benchmarking studies:

Comprehensive evaluation and characterisation of short read general-purpose structural variant calling software.

Comprehensive evaluation of structural variation detection algorithms for whole genome sequencing.

ADD COMMENTlink modified 12 weeks ago • written 12 weeks ago by d-cameron2.2k

elif not read.is_proper_pair:

FYI: you can't trust that flag to be set correctly by the aligner. Some aligners set it to true if the chromosome and orientation is correct, regardless of the distance between the read pairs.

ADD REPLYlink written 12 weeks ago by d-cameron2.2k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 955 users visited in the last hour