Question: How to combine multiple tools to detct SVs in WES data
1
gravatar for vivekruhela
9 months ago by
vivekruhela10
vivekruhela10 wrote:

Hi,

I want to use multiple tools (e.g. GATK, splitread etc) for detection of structural variation in WES data. Although I can use them individually but I want to use their combination for better result. But I don't know how to combine them. I need suggestions for better results in SVs detection.

Thanks.

snp sequence next-gen R • 569 views
ADD COMMENTlink modified 8 months ago • written 9 months ago by vivekruhela10

From my experience the best combination is pindel +CNVkit+ONCOcnv ;-)

ADD REPLYlink written 9 months ago by Korsocius90

@Korsocius : Thanks for reply. I was planning to use the combination GATK+Splitread+Sprites because I want lower false positive + good F-score + more novel SVs. May be I am wrong...can you suggest me why (pindel+CNVkit+ONCOcnv) is good. And what actually SV caller merging apps do...do that merge vcf file that we can also do by self.....enlighten me....

ADD REPLYlink written 9 months ago by vivekruhela10
4
gravatar for d-cameron
9 months ago by
d-cameron1.9k
Australia
d-cameron1.9k wrote:

And what actually SV caller merging apps do...do that merge vcf file that we can also do by self.....enlighten me....

SV merging is non-trivial due to the notational and detection differences of the various detection tools. Even getting them in a standard format is a challenge in itself. E.g. BreakDancer, Socrates, HYDRA, and GRIDSS (my tool, I highly recommend it ;) report all events in VCF breakend notation. Other tools use the alternate SVTYPE=INS/DEL/INV/DUP notation, others report the REF and ALT base sequences directly. Determining that the BND pair of records from one caller, the DUP call for another, and the ALT sequence that is longer than the REF in the third caller are actually the same call is a non-trivial task. On top of this, CNV callers are fundamentally different in that they report (changes in) abundance of DNA segments instead of novel DNA sequence adjacencies that the breakpoint callers report. Add inexact calling and sequence homology on top of that and you have quite the task ahead.

I have an R package (https://github.com/PapenfussLab/StructuralVariantAnnotation) that addresses the matching of calls from breakpoint-based callers but it doesn't convert that into a consensus call set, nor does it handle CNV calls.

I need suggestions for better results in SVs detection.

Running multiple callers to ensure coverage of the range of SVs you're interested in is a good approach (e.g. a general purpose SV breakpoint caller, a specialised microsatellite caller, and a CNV caller). Generaying a consensus call set based on multiple callers of the same type (e.g. pindel+delly+lumpy+manta+gridss) does not necessarily give you better results. There is considerable overlap in FPs between callers using the same methods and in many cases, you're better off just using the results of the best-in-class caller.

As you only have WES: what classes of SVs are you hoping to detect?

ADD COMMENTlink written 9 months ago by d-cameron1.9k
0
gravatar for Rohit
9 months ago by
Rohit1.3k
European union
Rohit1.3k wrote:

SV-Merge and MetaSV already perform merging with illumina-paired end data. If you have long-reads, then give NextSV a try.

ADD COMMENTlink written 9 months ago by Rohit1.3k

@Rohit: Mean coverage is around 100x and read length is 75bp. Is NextSV good for my data. Rest MetaSV is a python package and I am working in R. Is there any package in bioconductor or in R. Thanks.

ADD REPLYlink written 9 months ago by vivekruhela10

NextSV is based on long-reads, I don't think you can apply it your data. IntanSV seems good, never tested it though.

ADD REPLYlink written 9 months ago by Rohit1.3k
1

If you're wanting a standardised format to compare and annotate SVs in R, my StructuralVariantAnnotation package works a wider range of callers as well as any VCF file correctly following the standard, but doesn't actually do the merging (this is non-trivial since SVs matches are not necessarily transitive).

ADD REPLYlink written 8 months ago by d-cameron1.9k

Sorry for late reply. I have checked IntenSV. This package does not give vcf file as output and I am not sure if it is suitable for gatk variant callers.

ADD REPLYlink written 8 months ago by vivekruhela10
0
gravatar for vivekruhela
8 months ago by
vivekruhela10
vivekruhela10 wrote:

I am using CombineVariants for merging various vcf files obtained from gatk (haplotype), samtools and pindel. With this I can also extract their intersection i.e. variants which are common in all vcf files or use all of the variants found by all variant callers.

ADD COMMENTlink written 8 months ago by vivekruhela10

based on my reading of the documentation, it looks like CombineVariants is not SV-aware and will only work for SNVs and small indels. Variants calls using SVTYPE notation are likely to be incorrectly merged by that tool.

This may or may not be acceptable for your use case.

ADD REPLYlink modified 8 months ago • written 8 months ago by d-cameron1.9k

Sorry for late response. As I have checked the doccunentation of CombineVariants, nothing has mentioned about SV-aware or it is only for SNPs and INDELs. What's your experience says about this and I also would like to know its reason. I'm using genotype option PRIORITIZE to merge the vcf file of the same sample. What are possible errors by doing so? Thanks for your reply. Let me know the reasons ASAP.

ADD REPLYlink written 8 months ago by vivekruhela10
1

There does not exist any tool that performs the haplotype sequence reconstruction required to correctly combine SV variants in all cases.

ADD REPLYlink written 8 months ago by d-cameron1.9k

It's easiest to show an example. The following variants are all just different representations of the same variant. If the tools doesn't explicitly handle all representation then it won't merge correctly and that's even before CIPOS has to be considered.

Insertion
123-----4567890
ATA-----GGTTCGC
ATACTCAGGGTTCGC
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO
contig    3    ins_indel_representation1    A    ACTCAG    .    .    
contig    4    ins_indel_representation2    G    CTCAGG    .    .    
contig    3    ins_svtype_representation1    A    <INS>    .    .    SVTYPE=INS;SVLEN=5;END=3
contig    4    ins_svtype_representation2    G    <INS>    .    .    SVTYPE=INS;SVLEN=5;END=4
contig    3    ins_bnd_1    A    ACTCAG[contig:4[    .    .    SVTYPE=BND;PARID=ins_bnd_2;EVENT=example_ins
contig    4    ins_bnd_2    G    ]contig:3]CTCAGG    .    .    SVTYPE=BND;PARID=ins_bnd_1;EVENT=example_ins
ADD REPLYlink modified 8 months ago • written 8 months ago by d-cameron1.9k

Sorry I have again one more question: Is it ok to combine two gVCF files. One from Haplotype and another from Unifiedgenotype. I think it may be ok because both are complete files (i.e. consensus call) contains record from each position. So can we do this. Thanks.

ADD REPLYlink written 8 months ago by vivekruhela10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1044 users visited in the last hour