Dear Biostars community,
A rather broad, theoretical, question so apologies. I am a PhD student looking at the genetic architecture of a rare disease using WGS data. As part of this I am looking at structural variation in my cohort of 1500 patients with the disease and roughly 17000 controls (all Europeans as selected by PCA).
We have called the structural variants using Manta and Canvas and for each patient there is a "structural SV" vcf.gz file which is a merger of all the Manta and Canvas calls. These were done by the central consortium although we do have access to the BAM files also.
From my reading it looks like ultimately one "calls" SVs by pruning and filtering to the point of being able to visualize potential changes in a genome viewer on a case control level (I appreciate functional assays would then be needed to prove any suspicions). To me this seems as though one would miss potential biology, as well as being pretty tedious.
Question 1: what are acceptable filtering criteria for "rare" SVs? I was thinking of applying <0.001% allele frequency, those that pass basic QC and taking it from there. In terms of merging "similar" calls I was going to merge those that overlap >50%.
Question 2: Are there non-visual methods to annotate and call SVs on a case control level. SV-Int has been mooted but mainly focuses on non-coding regions (http://compbio.berkeley.edu/proj/svint/).
The sheer volume of SVs called at these patients numbers is vast and a visual method seems a rather terrifying prospect.
Things I've tried to far: - SURVIVOR (to merge VCFs on nearby BPs -https://github.com/fritzsedlazeck/SURVIVOR) - doens't work with zipped files unfortunately - SVtools - the merged VCFs with Manta and Canvas calls seem to upset it when using lmerge and lsort - will try and sort this out
Things I've looked into: - SVE: https://github.com/TheJacksonLaboratory/SVE - would need to run BAMs from scratch, therefore am keen to avoid - MAVIS https://github.com/bcgsc/mavis - seems promising but not sure if VCFS can be input into it - This pipeline from the Hall group: https://github.com/hall-lab/sv-pipeline - again seems promising but a)need to start from scratch with BAM calls and b) the outputs would then be visualised at a case control level.
Ideally an approach that uses the existing VCFs (in zipped format) would be ideal.
Once again if you've got this far thank you for reading and apologies for the long, rather theoretical quesion!
All the best