Extract SV indicating pair-end reads
1
0
Entering edit mode
4.2 years ago
ncxhit • 0

Hi everyone, I am new to bioinformatics and I was reading a review article on how to find structure variation using pair-end sequencing. While I understand what it means fine (if there is a deletion the distance between two read pair would decline and so on) I am struggling to write a program to find these "SV indicating" reads. Here's what I have done.

  1. align my reads to the reference genome using BWA-mem
  2. Using the following python code to find read pairs with either distance too big/small or hanging pair or alignment quality too small or bwa thinks it's not a proper match.
if (np.abs(read1_pos - read_mate_pos) > read_distance_big_threshold):
    output_pair_end_fastq(read, read_mate)
    # print("Distance Too Big")
elif read.mate_is_unmapped:
    output_pair_end_fastq(read, read_mate)
    # print("Hanging Pair")
elif (read.is_reverse == read.mate_is_reverse):
    output_pair_end_fastq(read, read_mate)
    # print("Orientation!")
elif np.abs(read_mate_pos - read1_pos) < read_distance_small_threshold:
    output_pair_end_fastq(read, read_mate)
    # print("Distance Too Small")
elif not read.is_proper_pair:
    output_pair_end_fastq(read, read_mate)
    # print("Not Proper Pair")
elif read.mapping_quality < read_mapping_quality_threshold:
    output_pair_end_fastq(read, read_mate)
    # print("Quality Too Low")
  

The normal distance of my pair end reads is 500bp and the read_distance_big_threshold is 600 while read_distance_small_threshold is 400.

My problem is I am having tons of false positives meaning I have found a lot of "non SV indicating reads" in my way. So my question is "Is there any existing tools that can do this kind of thing for me? "

Specificly I am looking for a tool to find all the reads indicated in this graph

Structure Variation pair-end sequencing • 950 views
ADD COMMENT
0
Entering edit mode

Is there any existing tools that can do this kind of thing for me?

i'm pretty sure that the article cites those tools....

ADD REPLY
0
Entering edit mode

I wouldn't be using 10 year old bioinformatics software unless it's been constantly updated over that period. BreakDancer is a perfect example of such a highly cited unmaintained SV caller. It performs terribly on read length is over 50bp and pretty much every tool is benchmarked against it and, unsurprisingly, everyone outperforms BreakDancer.

ADD REPLY
1
Entering edit mode
4.1 years ago
d-cameron ★ 2.9k

Is there any existing tools that can do this kind of thing for me?

There are many such tools. When I did a lit review for my thesis on SV calling 5 years ago, there were around ~40 general purpose SV callers. There's at least 70 now. I recommend chosing one of the callers that perform well (in data similar to yours, on the types of variants you want to call) in one of the recent benchmarking studies:

Comprehensive evaluation and characterisation of short read general-purpose structural variant calling software.

Comprehensive evaluation of structural variation detection algorithms for whole genome sequencing.

ADD COMMENT
0
Entering edit mode

elif not read.is_proper_pair:

FYI: you can't trust that flag to be set correctly by the aligner. Some aligners set it to true if the chromosome and orientation is correct, regardless of the distance between the read pairs.

ADD REPLY

Login before adding your answer.

Traffic: 1810 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6