Question: Why am I getting fewer variants with more samples?
0
gravatar for Ace
12 months ago by
Ace70
Ace70 wrote:

I'm working on some SNP analyses using GATK Haplotype Caller. In my initial test, with just 35 samples, I was getting over 350,000 SNPs. However, when I added in more samples this reduced. With my entire set of over 200 samples, it's barely over 1000, though if I cut about half of those out it's closer to 10,000. These are a conglomeration from 4 different data sets, but my original data set included two of the most different collection practices. I can't find anything in the documentation that seems to explain why this would happen, considering I've tried with the GVCF mode and got similar results. I'm imagining it's some outcome of the method of variant calling, but I want to make sure. Could anyone explain what could be causing this?

snp gatk variant calling • 279 views
ADD COMMENTlink modified 12 months ago • written 12 months ago by Ace70
2

You should start a thread of GATK Forums and link to it here. I recall exploring this topic a few years ago, when we discovered that the batch size affected the number of variants called, and it gave us a little doubt on the n+1 logic that GVCF files enable.

Eliminate all the filters (if you have any) in your pipeline and if the results differ even then, start a thread on GATK Forums

ADD REPLYlink written 12 months ago by RamRS27k

Sounds like a good idea. This is before filtration but I have a couple more things I'm trying this evening and if they don't pan out I'll post on GATK tomorrow and report back. In the meantime, do we still have an archive of the other discussion? I'd love to see what logic people came up with.

ADD REPLYlink modified 12 months ago • written 12 months ago by Ace70

I don't recall initiating a conversation on GATK Forums, unfortunately. We were collaborating with Broad and it was easier to speak to my colleague who worked for both Broad and my team. There were a few emails exchanged but I don't have a record of them now. I'm sorry!

ADD REPLYlink modified 12 months ago • written 12 months ago by RamRS27k
1

As an update, I did post this on GATK forum https://gatkforums.broadinstitute.org/gatk/discussion/24163/larger-sample-sizes-are-reducing-snps-dramatically

However, I noticed looking through some VCF files that the caller is only registering chromosome 1 so my new challenge is tracking that down.

ADD REPLYlink modified 12 months ago • written 12 months ago by Ace70
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1440 users visited in the last hour