Question

How to combine multiple tools to detct SVs in WES data

1

Entering edit mode

6.1 years ago

vivekruhela ▴ 20

Hi,

I want to use multiple tools (e.g. GATK, splitread etc) for detection of structural variation in WES data. Although I can use them individually but I want to use their combination for better result. But I don't know how to combine them. I need suggestions for better results in SVs detection.

Thanks.

SNP R next-gen sequence snp • 3.5k views

ADD COMMENT • link 6.1 years ago by vivekruhela ▴ 20

0

Entering edit mode

From my experience the best combination is pindel +CNVkit+ONCOcnv ;-)

ADD REPLY • link 6.1 years ago by Korsocius ▴ 250

0

Entering edit mode

@Korsocius : Thanks for reply. I was planning to use the combination GATK+Splitread+Sprites because I want lower false positive + good F-score + more novel SVs. May be I am wrong...can you suggest me why (pindel+CNVkit+ONCOcnv) is good. And what actually SV caller merging apps do...do that merge vcf file that we can also do by self.....enlighten me....

ADD REPLY • link 6.1 years ago by vivekruhela ▴ 20

score 4 · Answer 1 · 2018-03-09

And what actually SV caller merging apps do...do that merge vcf file that we can also do by self.....enlighten me....

SV merging is non-trivial due to the notational and detection differences of the various detection tools. Even getting them in a standard format is a challenge in itself. E.g. BreakDancer, Socrates, HYDRA, and GRIDSS (my tool, I highly recommend it ;) report all events in VCF breakend notation. Other tools use the alternate SVTYPE=INS/DEL/INV/DUP notation, others report the REF and ALT base sequences directly. Determining that the BND pair of records from one caller, the DUP call for another, and the ALT sequence that is longer than the REF in the third caller are actually the same call is a non-trivial task. On top of this, CNV callers are fundamentally different in that they report (changes in) abundance of DNA segments instead of novel DNA sequence adjacencies that the breakpoint callers report. Add inexact calling and sequence homology on top of that and you have quite the task ahead.

I have an R package (https://github.com/PapenfussLab/StructuralVariantAnnotation) that addresses the matching of calls from breakpoint-based callers but it doesn't convert that into a consensus call set, nor does it handle CNV calls.

I need suggestions for better results in SVs detection.

Running multiple callers to ensure coverage of the range of SVs you're interested in is a good approach (e.g. a general purpose SV breakpoint caller, a specialised microsatellite caller, and a CNV caller). Generaying a consensus call set based on multiple callers of the same type (e.g. pindel+delly+lumpy+manta+gridss) does not necessarily give you better results. There is considerable overlap in FPs between callers using the same methods and in many cases, you're better off just using the results of the best-in-class caller.

As you only have WES: what classes of SVs are you hoping to detect?

score 0 · Answer 2 · 2018-03-08

0

Entering edit mode

6.1 years ago

Rohit ★ 1.5k

SV-Merge and MetaSV already perform merging with illumina-paired end data. If you have long-reads, then give NextSV a try.

ADD COMMENT • link 6.1 years ago by Rohit ★ 1.5k

0

Entering edit mode

@Rohit: Mean coverage is around 100x and read length is 75bp. Is NextSV good for my data. Rest MetaSV is a python package and I am working in R. Is there any package in bioconductor or in R. Thanks.

ADD REPLY • link 6.1 years ago by vivekruhela ▴ 20

0

Entering edit mode

NextSV is based on long-reads, I don't think you can apply it your data. IntanSV seems good, never tested it though.

ADD REPLY • link 6.1 years ago by Rohit ★ 1.5k

1

Entering edit mode

If you're wanting a standardised format to compare and annotate SVs in R, my StructuralVariantAnnotation package works a wider range of callers as well as any VCF file correctly following the standard, but doesn't actually do the merging (this is non-trivial since SVs matches are not necessarily transitive).

ADD REPLY • link 6.1 years ago by d-cameron ★ 2.9k

0

Entering edit mode

Sorry for late reply. I have checked IntenSV. This package does not give vcf file as output and I am not sure if it is suitable for gatk variant callers.

ADD REPLY • link 6.1 years ago by vivekruhela ▴ 20

score 0 · Answer 3 · 2018-03-15

0

Entering edit mode

6.1 years ago

vivekruhela ▴ 20

I am using CombineVariants for merging various vcf files obtained from gatk (haplotype), samtools and pindel. With this I can also extract their intersection i.e. variants which are common in all vcf files or use all of the variants found by all variant callers.

ADD COMMENT • link 6.1 years ago by vivekruhela ▴ 20

0

Entering edit mode

based on my reading of the documentation, it looks like CombineVariants is not SV-aware and will only work for SNVs and small indels. Variants calls using SVTYPE notation are likely to be incorrectly merged by that tool.

This may or may not be acceptable for your use case.

ADD REPLY • link 6.1 years ago by d-cameron ★ 2.9k

0

Entering edit mode

Sorry for late response. As I have checked the doccunentation of CombineVariants, nothing has mentioned about SV-aware or it is only for SNPs and INDELs. What's your experience says about this and I also would like to know its reason. I'm using genotype option PRIORITIZE to merge the vcf file of the same sample. What are possible errors by doing so? Thanks for your reply. Let me know the reasons ASAP.

ADD REPLY • link 6.1 years ago by vivekruhela ▴ 20

1

Entering edit mode

There does not exist any tool that performs the haplotype sequence reconstruction required to correctly combine SV variants in all cases.

ADD REPLY • link 6.1 years ago by d-cameron ★ 2.9k

0

Entering edit mode

It's easiest to show an example. The following variants are all just different representations of the same variant. If the tools doesn't explicitly handle all representation then it won't merge correctly and that's even before CIPOS has to be considered.

Insertion
123-----4567890
ATA-----GGTTCGC
ATACTCAGGGTTCGC
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO
contig    3    ins_indel_representation1    A    ACTCAG    .    .    
contig    4    ins_indel_representation2    G    CTCAGG    .    .    
contig    3    ins_svtype_representation1    A    <INS>    .    .    SVTYPE=INS;SVLEN=5;END=3
contig    4    ins_svtype_representation2    G    <INS>    .    .    SVTYPE=INS;SVLEN=5;END=4
contig    3    ins_bnd_1    A    ACTCAG[contig:4[    .    .    SVTYPE=BND;PARID=ins_bnd_2;EVENT=example_ins
contig    4    ins_bnd_2    G    ]contig:3]CTCAGG    .    .    SVTYPE=BND;PARID=ins_bnd_1;EVENT=example_ins

ADD REPLY • link 6.1 years ago by d-cameron ★ 2.9k

0

Entering edit mode

Sorry I have again one more question: Is it ok to combine two gVCF files. One from Haplotype and another from Unifiedgenotype. I think it may be ok because both are complete files (i.e. consensus call) contains record from each position. So can we do this. Thanks.

ADD REPLY • link 6.1 years ago by vivekruhela ▴ 20