Variants called differ dramatically between runs
1
1
Entering edit mode
7.8 years ago
Ram 44k

Hello everyone,

I'm comparing 3 sets of variants from 3 different NGS pipeline runs, where all 3 operate on a common set of 150 samples. I see a lot of difference in the variants found in each run.

The three runs are structured as follows:

  1. The 150 samples are run as a single batch
  2. The 150 samples are run as part of a larger cohort (total cohort size = 250 samples, say)
  3. The 150 samples are split into 5 batches of 30 samples each

For #2, GATK SelectVariants is used to extract variants found in the 150 samples. For #3, CombineVariants is used to combine the VCF files.

When I look at a venn diagram of these 3 sets, I see only around a 40% overlap in variants. For reference, 100% is the set of all unique variants discovered across all 3 runs.

To exclude pipeline quirks (AKA "this is the default behavior"), I compared 2 runs that were run 6 months apart on the exact same 25+ samples, and the variants discovered were identical. So, we can confidently say that only the cohort size difference could have caused this gap in variant discovery.

Can we discuss what could be the case here please? I wish to understand why I see 3/5 of the dataset not being called in at least one of the runs.

EDIT: These 3 runs are only computational NGS pipeline runs, they're not sequencing runs. In other words, I'm using the same BAM files across the board.

vcf variants • 2.1k views
ADD COMMENT
1
Entering edit mode

Are these samples barcoded somehow so you totally completely 100% don't have an extra 100 samples worth of variants in #2? That would be might first guess, if the QC from the samples otherwise looks normal (sequencing depth, coverage, etc)

ADD REPLY
1
Entering edit mode

I'm sorry, I should've mentioned - these 3 runs are only on the NGS pipeline. I'm using the same BAM files.

ADD REPLY
0
Entering edit mode

Are these exomes with tumor/normal samples ? There are multiple papers where they have compared analysis from different tools, such as GTAK, varscan, somaticSnipper, and so on. At lower threshold of filter, overlap is infact small, and oveelap number varies depending upon the filtering threshold. If you take top 50 variants (p-value based) from each tools, what is the overlap.

ADD REPLY
0
Entering edit mode

These are Whole Exome. The tool used for all runs is GATK's HaplotypeCaller, so I don't understand what you mean by from each tools.

ADD REPLY
0
Entering edit mode

Are the positions not called as variant in set A in set B/C reference call or no-call, mostly?

ADD REPLY
0
Entering edit mode

I'll check that and get back to you, @WouterDeCoster.

ADD REPLY
5
Entering edit mode
7.8 years ago
Steven Lakin ★ 1.8k

Correct me if I misunderstood your question, but it seems like you are splitting or combining samples into various sized groups and then running those through GATK.

From the supplemental material of GATK on haplotype calling:

Whereas the initial phase of the algorithm is run per sample, the second stage combines the genotype likelihoods over all samples in order to determine the most likely alternate allele frequency in the cohort. The likelihood for a given set of genotype assignments at a given frequency is simply the product of the genotype likelihoods for each sample given that sample’s assigned genotype (Box 1). We then apply a population genetic prior to the allele frequency likelihoods based on θ, the population specific heterozygosity, and choose the most likely allele frequency and associated genotype assignments. The variant quality score for a polymorphic call is given as -10*log10(probability that the site is actually monomorphic).

Most variant callers use some kind of Bayesian modeling for within-group variation, since being able to model within-group variation significantly increases the likelihood that the remaining variation is due to inter-group variation (usually treatment vs control, tumor vs normal in most experimental designs). Splitting the samples into artificial cohorts of sizes that are not true to the actual data will affect the results of this calculation substantially.

ADD COMMENT
0
Entering edit mode

Thank you. Just so I can learn more, could you point me to where I could find the supplemental material of GATK on haplotype calling please? I tried searching for a HC paper, but there doesn't seem to be one out yet.

EDIT: Is this the paper? http://www.nature.com/ng/journal/v43/n5/full/ng.806.html

ADD REPLY
1
Entering edit mode

Yep, that's the one. There is a free PDF version here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3083463/

The supplemental material is at the bottom; usually authors publish their math in the supplements.

ADD REPLY
0
Entering edit mode

Thank you! I do have institutional access, but I'm sure the free PDF will be useful for people who don't have the access.

ADD REPLY

Login before adding your answer.

Traffic: 850 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6