Question

Variants called differ dramatically between runs

1

Entering edit mode

7.3 years ago

Ram 43k

Hello everyone,

I'm comparing 3 sets of variants from 3 different NGS pipeline runs, where all 3 operate on a common set of 150 samples. I see a lot of difference in the variants found in each run.

The three runs are structured as follows:

The 150 samples are run as a single batch
The 150 samples are run as part of a larger cohort (total cohort size = 250 samples, say)
The 150 samples are split into 5 batches of 30 samples each

For #2, GATK SelectVariants is used to extract variants found in the 150 samples. For #3, CombineVariants is used to combine the VCF files.

When I look at a venn diagram of these 3 sets, I see only around a 40% overlap in variants. For reference, 100% is the set of all unique variants discovered across all 3 runs.

To exclude pipeline quirks (AKA "this is the default behavior"), I compared 2 runs that were run 6 months apart on the exact same 25+ samples, and the variants discovered were identical. So, we can confidently say that only the cohort size difference could have caused this gap in variant discovery.

Can we discuss what could be the case here please? I wish to understand why I see 3/5 of the dataset not being called in at least one of the runs.

EDIT: These 3 runs are only computational NGS pipeline runs, they're not sequencing runs. In other words, I'm using the same BAM files across the board.

vcf variants • 1.9k views

ADD COMMENT • link updated 7.3 years ago by Steven Lakin ★ 1.8k • written 7.3 years ago by Ram 43k

1

Entering edit mode

Are these samples barcoded somehow so you totally completely 100% don't have an extra 100 samples worth of variants in #2? That would be might first guess, if the QC from the samples otherwise looks normal (sequencing depth, coverage, etc)

ADD REPLY • link 7.3 years ago by John 13k

1

Entering edit mode

I'm sorry, I should've mentioned - these 3 runs are only on the NGS pipeline. I'm using the same BAM files.

ADD REPLY • link 7.3 years ago by Ram 43k

0

Entering edit mode

Are these exomes with tumor/normal samples ? There are multiple papers where they have compared analysis from different tools, such as GTAK, varscan, somaticSnipper, and so on. At lower threshold of filter, overlap is infact small, and oveelap number varies depending upon the filtering threshold. If you take top 50 variants (p-value based) from each tools, what is the overlap.

ADD REPLY • link 7.3 years ago by Chirag Nepal ★ 2.4k

0

Entering edit mode

These are Whole Exome. The tool used for all runs is GATK's HaplotypeCaller, so I don't understand what you mean by from each tools.

ADD REPLY • link 7.3 years ago by Ram 43k

0

Entering edit mode

Are the positions not called as variant in set A in set B/C reference call or no-call, mostly?

ADD REPLY • link 7.3 years ago by WouterDeCoster 47k

0

Entering edit mode

I'll check that and get back to you, @WouterDeCoster.

ADD REPLY • link 7.3 years ago by Ram 43k

score 5 · Accepted Answer · 2017-01-09

Correct me if I misunderstood your question, but it seems like you are splitting or combining samples into various sized groups and then running those through GATK.

From the supplemental material of GATK on haplotype calling:

Whereas the initial phase of the algorithm is run per sample, the second stage combines the genotype likelihoods over all samples in order to determine the most likely alternate allele frequency in the cohort. The likelihood for a given set of genotype assignments at a given frequency is simply the product of the genotype likelihoods for each sample given that sample’s assigned genotype (Box 1). We then apply a population genetic prior to the allele frequency likelihoods based on θ, the population specific heterozygosity, and choose the most likely allele frequency and associated genotype assignments. The variant quality score for a polymorphic call is given as -10*log10(probability that the site is actually monomorphic).

Most variant callers use some kind of Bayesian modeling for within-group variation, since being able to model within-group variation significantly increases the likelihood that the remaining variation is due to inter-group variation (usually treatment vs control, tumor vs normal in most experimental designs). Splitting the samples into artificial cohorts of sizes that are not true to the actual data will affect the results of this calculation substantially.