Question: Choosing controls for structural variation detection
gravatar for michealsmith
4.4 years ago by
michealsmith750 wrote:

I'm trying to detect structural variation using NGS data, more specifically to find novel or rare SV from disease samples. But my research targets are common neurological disease, not cancer, so there's NO a perfectly matched control to remove background. Also because there's lot of noise from read mapping and SV callings, so when running SV calling softwares I decided to include controls from 1000 genome project to remove as much background noise as possible to look for rare or novel SV.

I know 1000genome provides a list of high-confidence SV, but that's been through high-standard filtering with many more complex SV undetected; so if using this list to filter for rare/novel variants, there'll be many false positive.


  1. How many controls should I use? Ideally the more the better? I have 20 whole-genome sequences of patients to run. But considering the bam file size, I first tried only 10 CEU low-coverage WGS from 1000genome.
  2. Many programs like breakdancer or pindel support to run multiple files. But do these programs apply statistics to all these parameter as a whole, or still apply statistics to each file and merge all statistical results together?
  3. Control bam files from 1000 genome could have different insert size, mapped to different version of hg19/g1k_37, would that matter when I include these bams together with my patient bam files to call for SV?
1000 genome project ngs sv • 1.5k views
ADD COMMENTlink modified 21 months ago by RamRS27k • written 4.4 years ago by michealsmith750
gravatar for Zev.Kronenberg
4.4 years ago by
United States
Zev.Kronenberg11k wrote:
  1. More is better. Check if you can find the Human Genome Diversity Project data.

    "Global diversity, population stratification, and selection of human copy number variation"

    It is also a good idea to run CHM1 or other genomes that have pacbio SV calls (allows you to threshold accuracy)

  2. I can't speak for all programs, but calling SVs across many people usually increases the sensitivity and false discovery rate. I like to call individuals separately, merge the calls, and then joint genotype. It insures a single call has enough support within a single diploid. Joint genotyping mitigates missed calls. See my workflow.

  3. Most tools model insert size on a per library basis. AKA you don't need to worry about it.

Now for unsolicited advice. I really recommend calling with multiple tools. We use LUMPY, WHAM, Genome STRiP and Delly. Using multiple caller and merging callsets will really help.

ADD COMMENTlink modified 22 months ago by RamRS27k • written 4.4 years ago by Zev.Kronenberg11k

Many thanks Zev, I'm writing another post, actually my question is not about control, but about whole design of running multiple tools for multiple samples.

ADD REPLYlink written 4.4 years ago by michealsmith750

"Using multiple caller and merging callsets will really help" I'm wondering here by "merging" you mean overlap/intersect or do union? I guess do a union? But each caller alone will achieve high false positive, not to mention union...

ADD REPLYlink written 4.3 years ago by michealsmith750
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 716 users visited in the last hour