Choosing controls for structural variation detection
1
1
Entering edit mode
8.2 years ago
michealsmith ▴ 790

I'm trying to detect structural variation using NGS data, more specifically to find novel or rare SV from disease samples. But my research targets are common neurological disease, not cancer, so there's NO a perfectly matched control to remove background. Also because there's lot of noise from read mapping and SV callings, so when running SV calling softwares I decided to include controls from 1000 genome project to remove as much background noise as possible to look for rare or novel SV.

I know 1000genome provides a list of high-confidence SV, but that's been through high-standard filtering with many more complex SV undetected; so if using this list to filter for rare/novel variants, there'll be many false positive.

Questions:

  1. How many controls should I use? Ideally the more the better? I have 20 whole-genome sequences of patients to run. But considering the bam file size, I first tried only 10 CEU low-coverage WGS from 1000genome.
  2. Many programs like breakdancer or pindel support to run multiple files. But do these programs apply statistics to all these parameter as a whole, or still apply statistics to each file and merge all statistical results together?
  3. Control bam files from 1000 genome could have different insert size, mapped to different version of hg19/g1k_37, would that matter when I include these bams together with my patient bam files to call for SV?
NGS SV 1000 genome project • 2.3k views
ADD COMMENT
1
Entering edit mode
8.2 years ago
  1. More is better. Check if you can find the Human Genome Diversity Project data.

    "Global diversity, population stratification, and selection of human copy number variation"

    It is also a good idea to run CHM1 or other genomes that have pacbio SV calls (allows you to threshold accuracy)

  2. I can't speak for all programs, but calling SVs across many people usually increases the sensitivity and false discovery rate. I like to call individuals separately, merge the calls, and then joint genotype. It insures a single call has enough support within a single diploid. Joint genotyping mitigates missed calls. See my workflow.

  3. Most tools model insert size on a per library basis. AKA you don't need to worry about it.

Now for unsolicited advice. I really recommend calling with multiple tools. We use LUMPY, WHAM, Genome STRiP and Delly. Using multiple caller and merging callsets will really help.

ADD COMMENT
0
Entering edit mode

Many thanks Zev, I'm writing another post, actually my question is not about control, but about whole design of running multiple tools for multiple samples.

ADD REPLY
0
Entering edit mode

"Using multiple caller and merging callsets will really help" I'm wondering here by "merging" you mean overlap/intersect or do union? I guess do a union? But each caller alone will achieve high false positive, not to mention union...

ADD REPLY

Login before adding your answer.

Traffic: 2452 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6