Question: Periodicity in SNP calling quality - is this normal?
gravatar for John
5.1 years ago by
John12k wrote:

Hello all!

After calling SNPs on some RNA data to produce a gVCF file, I ran a little python script to see the distribution of calling quality across the different genotypes:

- x-axis is quality score rounded to the nearest integer
- y-axis is the number of variants at that quality score

As you can see, its mostly heterozygous variants, which is what I expect since this data comes from highly inbred mice.
What i didn't expect however is the periodicity. Is that normal?
Now I presumably I need to filter these variants on some number of quality score, and from this I really dont know where. 0? 50? 75?

Code to generate this data:

#!/usr/bin/env python2.7
import collections
with open('/home/john/overnight/outputs/ctrl_all_FVB.gvcf', 'rb') as f:
    data = {}
    for line in f:
        if line[0] == '#': continue
        line = line.split('\t')
        if line[5] == '.': continue
        gt = line[9][:3]
        try: data[gt][int(float(line[5]))] += 1
        except KeyError: data[gt] = collections.defaultdict(int)
for gt,qualities in data.items():
    print '\n',gt
    for qual,count in sorted(qualities.items()):
        print qual,count



The Dev team at GATK say this is normal and not something to worry about :)
Apparently one should not filter a gvcf directly, but rather call variants first using GenotypeGVCFs, then filter in the more standard way on that output. GATK forum post

rna-seq snp gatk vcf • 1.4k views
ADD COMMENTlink modified 5.1 years ago • written 5.1 years ago by John12k

Interesting observation, I've actually seen this exact thing before...also with GATK's haplotype caller, though I don't recall if I was looking a variants from RNAseq or DNAseq reads. I've sort of assumed that this is some oddity of the algorithm.

ADD REPLYlink written 5.1 years ago by Devon Ryan97k

Yeah I guess so - i'll go ask over at the GATK forums since this is probably something to do with their caller and not, hopefully, my data :P ;)
I will update this thread if I hear anything from them about this, and update the other thread (about gvcf calling for RNA data) once i've got a working pipeline! 

ADD REPLYlink written 5.1 years ago by John12k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1080 users visited in the last hour