Question

HaplotypeCaller output - All records are non-variant - Have I gone wrong somewhere?

0

Entering edit mode

5.9 years ago

BioinfGuru ★ 1.7k

Hello everyone,

Following the GATK germline short variant discovery instructions, I have written a script that runs the pipeline from fastq pre-processing through to creating a gvcf with HaplotypeCaller. On inspection I have noticed that my GVCF:

1) Does not have a ##reference entry in the header even though i did pass a reference in the command

--reference /NGS/musRefs_10/gatk/ref/ensembl92/Mus_musculus.GRCm38.dna.toplevel.fa

2) Every record except 1 contains "END=" for example:

9 36758659 . C NON_REF . . END=36758689 GT:DP:GQ:MIN_DP:PL 0/0:1:3:1:0,3,38

The 1 record does not contain "END=":

"9 36758714 rs45725377 T C,<non_ref> 22.58 . DB;DP=2;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQ=7200.00 GT:AD:DP:GQ:PL:SB 1/1:0,2,0:2:6:49,6,0,49,6,49:0,0,1,1"

I have read What is a GVCF and how is it different from a 'regular' VCF? which states:

The second thing to look for is the END tag in the INFO field of non-variant block records.

Is it normal that every entry except for 1 is a non-variant block record? No errors are being returned. The only thing I can think of is that I ran the script on too small a test subset of the whole data. The test data is 2 fastq files of 10,000 lines.

I'm really not sure where I went wrong here. I appreciate any guidance.

Regards, Kenneth

SNP haplotypecaller NON_REF gatk • 2.4k views

ADD COMMENT • link 5.9 years ago by BioinfGuru ★ 1.7k

score 2 · Accepted Answer · 2018-06-07

2

Entering edit mode

5.9 years ago

finswimmer 16k

Hello YaGalbi,

10,000 lines in a fastq file means you have 2500 reads per file. That is a very small dataset. Depending on the library prep this means you have a lot of non covered regions or just a very small region of interest. For both variants it is not realy surprising that you just found one position with a variant.

fin swimmer

ADD COMMENT • link 5.9 years ago by finswimmer 16k

0

Entering edit mode

Ok finswimmer...thanks for that. It is a whole exome dataset and I use gene coords as the intervals option.

I just created a larger subset of 1,000,000 lines and that should be done in a few hours so hopefully. Ill see a proportional increase.

EDIT: That didnt take as long as I thought - 30min (nice optimizing!). 6142 variants returned from 250000 reads. I think there is no problem.

ADD REPLY • link 5.9 years ago by BioinfGuru ★ 1.7k