Approach to structural variation analysis - case vs control
Entering edit mode
3.9 years ago
tacrolimus ▴ 140

Dear Biostars community,

A rather broad, theoretical, question so apologies. I am a PhD student looking at the genetic architecture of a rare disease using WGS data. As part of this I am looking at structural variation in my cohort of 1500 patients with the disease and roughly 17000 controls (all Europeans as selected by PCA).

We have called the structural variants using Manta and Canvas and for each patient there is a "structural SV" vcf.gz file which is a merger of all the Manta and Canvas calls. These were done by the central consortium although we do have access to the BAM files also.

From my reading it looks like ultimately one "calls" SVs by pruning and filtering to the point of being able to visualize potential changes in a genome viewer on a case control level (I appreciate functional assays would then be needed to prove any suspicions). To me this seems as though one would miss potential biology, as well as being pretty tedious.

Question 1: what are acceptable filtering criteria for "rare" SVs? I was thinking of applying <0.001% allele frequency, those that pass basic QC and taking it from there. In terms of merging "similar" calls I was going to merge those that overlap >50%.

Question 2: Are there non-visual methods to annotate and call SVs on a case control level. SV-Int has been mooted but mainly focuses on non-coding regions (

The sheer volume of SVs called at these patients numbers is vast and a visual method seems a rather terrifying prospect.

Things I've tried to far: - SURVIVOR (to merge VCFs on nearby BPs - - doens't work with zipped files unfortunately - SVtools - the merged VCFs with Manta and Canvas calls seem to upset it when using lmerge and lsort - will try and sort this out

Things I've looked into: - SVE: - would need to run BAMs from scratch, therefore am keen to avoid - MAVIS - seems promising but not sure if VCFS can be input into it - This pipeline from the Hall group: - again seems promising but a)need to start from scratch with BAM calls and b) the outputs would then be visualised at a case control level.

Ideally an approach that uses the existing VCFs (in zipped format) would be ideal.

Once again if you've got this far thank you for reading and apologies for the long, rather theoretical quesion!

All the best


Structural variation • 1.3k views
Entering edit mode
3.9 years ago
d-cameron ★ 2.9k

1,500 patients is an aweful lot of patients to have the same "rare" disease. Are these 1500 different rare diseases or all the one disease?

The analysis I've seen done on rare diseases have used family trees of affected/not affected to prune the massive noise that you'll get by comparing populations.

If you indeed have 1500 patients with a disease you suspect is genetic (do you know this?), and 17000 unrelated controls, then:

  • If your cases are unrelated they may not have the same causal variant. There are many ways to break a gene. Some might have SNVs in the gene, others SVs. You might be looking for multiple SVs in a gene or even a pathway. This makes analysis really messy as we don't currently have any high quality variant effect predictors for SVs.

  • Your toolset may be inappropriate for your variants of interest. For example, if you're looking into ataxia then your causal variant is highly likely to be a pathogenic microsatellite repeat expansion. AFAIK, none of the tools you listed will detect such variants.

  • Similarly, if your causal variant is a SVA or LINE insertion (e.g. that breaks BRCA thus giving familial cancer susceptibility but it's currently picked up by any BRCA tests) then your variant will look like a translocation to the donor SVA or LINE and you're likely to either a) filter it as noise (under the assumption that interchromosomal events are false positives), or b) have it not be called at all (as would be the case with manta if the donor sequence was repeated in the reference). I'd recommend my tool (GRIDSS) as it'd be capable of picking up such events but I think you're better off solving your bigger, more general issues first.

  • Calculating SV overlap is decidedly non-trival. Sequencing errors can adjust call position bounds, micro-homology at the breakpoint can shift events. Furthermore, since your SV and CNV callers aren't integrated, the bounds for the CN and SV calls for the same event can be very different. Luckily, you don't have to reconcile calls from short and long read data as they are particularly problematic (e.g. there's currently no tool that will correctly intrepret calls made by short read and long read caller for events such a SINE expansion from 2 repeats to 3 repeats).

overlap >50%

  • This is problematic for two reasons. 1) 50% is a huge margin. Are you sure you want to match a 5k deletion with the 10k deletion that fully encloses it? If you're matching manta with canvas they maybe you do, if you're matching manta with manta then you definitely don't. The entire signal for an SV (ie breakpoint) caller such as manta is within 1 library fragment size length of the two breakend positions (ie. ~300bp).

  • No matter what your threshold is, your call set is going to either be too conservative to determine a population background, or too liberal for an managable number of events. You mention a "<0.001% allele frequency", that is both excessively conservative (that corresponds to 2 individuals in your 17000 controls. Are you that sure you don't have 2 individuals with your disease in your controls?) and also problematic given you likely want to use a lower support threshold for inclusion in the controls than you do for calling variants in your cases. When I didn't analysis like this, I used multiple thresholds to liberally match to control, but conservatively call in cases.

  • You're going to need to come up with a better merging strategy than 50% overlap. Take the following: A=[100,200], B=[150,250], C=[200,300], D=[250,350]. Each overlaps then next by 50% but A does not overlap D at all. One way to avoid this is to base everything off pair-wise comparisons (e.g. this call overlaps X calls in the control set) but usually people just do something like greedy clustering.

  • Try not to write your own caller/meta-caller. There's already 100+ tools out there. Most of them are terrible and, statistically speaking, yours is likely to be as well. If you want to solve your biological problem, don't accidentally turn your PhD into a methods PhD. Focus on what is achievable with the tools and data currently available to you. In this case it's results from manta/canvas. Get as far as you can with those VCFs before turning elsewhere. Even if you don't find anything, you'll be in a much better position to know what to do next and you'll have a much better idea of what your data looks like.

TLDR: SVs are hard. Good luck with your PhD! :)

Entering edit mode

Dear Cameron,

Many thanks for the swift, honest and helpful reply. To one of your questions: - 800 of 1500 definitely have a homogeneous genetic disease, this represents the largest cohort ever assembled. The others potentially have a variant or a phenotypically similar disease with a different genetic causality.

In regard to your other points, agreed! It already feels like one can spend a LOT of time exploring tools, glint in one's eye, seeking that code to solve the SV issue. It just feels very different coming from SNP data where a P-value of <10x-"large number" is the way forward and now it's -> get it into BEDPE and have a look, maybe that's a valid SV?

I think I'll start by coming up with a semi-sensible merge strategy, a way of pruning away noise (we have around 200 trios) and see what we find.

Many thanks again for taking the time to answer my questions.

All the best


I'll see what I can do with the VCFs I have (I see your lab has come up with StructuralVariantAnnotation which I will play around with)


Login before adding your answer.

Traffic: 2066 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6