Periodicity in SNP calling quality - is this normal?
Entering edit mode
8.0 years ago
John 13k

Hello all!

After calling SNPs on some RNA data to produce a gVCF file, I ran a little python script to see the distribution of calling quality across the different genotypes:

  • x-axis is quality score rounded to the nearest integer
  • y-axis is the number of variants at that quality score

As you can see, its mostly heterozygous variants, which is what I expect since this data comes from highly inbred mice. What I didn't expect however is the periodicity. Is that normal?

Now I presumably I need to filter these variants on some number of quality score, and from this I really dont know where. 0? 50? 75?

Code to generate this data:

#!/usr/bin/env python2.7
import collections
with open('/home/john/overnight/outputs/ctrl_all_FVB.gvcf', 'rb') as f:
    data = {}
    for line in f:
        if line[0] == '#': continue
        line = line.split('\t')
        if line[5] == '.': continue
        gt = line[9][:3]
        try: data[gt][int(float(line[5]))] += 1
        except KeyError: data[gt] = collections.defaultdict(int)
for gt,qualities in data.items():
    print '\n',gt
    for qual,count in sorted(qualities.items()):
        print qual,count


The Dev team at GATK say this is normal and not something to worry about :)

Apparently one should not filter a gvcf directly, but rather call variants first using GenotypeGVCFs, then filter in the more standard way on that output. GATK forum post:

RNA-Seq VCF GATK SNP • 1.8k views
Entering edit mode

Interesting observation, I've actually seen this exact thing before...also with GATK's haplotype caller, though I don't recall if I was looking a variants from RNAseq or DNAseq reads. I've sort of assumed that this is some oddity of the algorithm.

Entering edit mode

Yeah I guess so - I'll go ask over at the GATK forums since this is probably something to do with their caller and not, hopefully, my data :P ;)

I will update this thread if I hear anything from them about this, and update the other thread (about gvcf calling for RNA data) once I've got a working pipeline!


Login before adding your answer.

Traffic: 1226 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6