Question: Why am I getting fewer variants with more samples?
0
gravatar for Ace
3 months ago by
Ace60
Ace60 wrote:

I'm working on some SNP analyses using GATK Haplotype Caller. In my initial test, with just 35 samples, I was getting over 350,000 SNPs. However, when I added in more samples this reduced. With my entire set of over 200 samples, it's barely over 1000, though if I cut about half of those out it's closer to 10,000. These are a conglomeration from 4 different data sets, but my original data set included two of the most different collection practices. I can't find anything in the documentation that seems to explain why this would happen, considering I've tried with the GVCF mode and got similar results. I'm imagining it's some outcome of the method of variant calling, but I want to make sure. Could anyone explain what could be causing this?

snp gatk variant calling • 185 views
ADD COMMENTlink modified 3 months ago • written 3 months ago by Ace60
2

You should start a thread of GATK Forums and link to it here. I recall exploring this topic a few years ago, when we discovered that the batch size affected the number of variants called, and it gave us a little doubt on the n+1 logic that GVCF files enable.

Eliminate all the filters (if you have any) in your pipeline and if the results differ even then, start a thread on GATK Forums

ADD REPLYlink written 3 months ago by RamRS24k

Sounds like a good idea. This is before filtration but I have a couple more things I'm trying this evening and if they don't pan out I'll post on GATK tomorrow and report back. In the meantime, do we still have an archive of the other discussion? I'd love to see what logic people came up with.

ADD REPLYlink modified 3 months ago • written 3 months ago by Ace60

I don't recall initiating a conversation on GATK Forums, unfortunately. We were collaborating with Broad and it was easier to speak to my colleague who worked for both Broad and my team. There were a few emails exchanged but I don't have a record of them now. I'm sorry!

ADD REPLYlink modified 3 months ago • written 3 months ago by RamRS24k
1

As an update, I did post this on GATK forum https://gatkforums.broadinstitute.org/gatk/discussion/24163/larger-sample-sizes-are-reducing-snps-dramatically

However, I noticed looking through some VCF files that the caller is only registering chromosome 1 so my new challenge is tracking that down.

ADD REPLYlink modified 3 months ago • written 3 months ago by Ace60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1139 users visited in the last hour