Question: HaplotypeCaller output - All records are non-variant - Have I gone wrong somewhere?
gravatar for YaGalbi
22 months ago by
Biocomputing, MRC Harwell Institute, Oxford, UK
YaGalbi1.5k wrote:

Hello everyone,

Following the GATK germline short variant discovery instructions, I have written a script that runs the pipeline from fastq pre-processing through to creating a gvcf with HaplotypeCaller. On inspection I have noticed that my GVCF:

1) Does not have a ##reference entry in the header even though i did pass a reference in the command

--reference /NGS/musRefs_10/gatk/ref/ensembl92/Mus_musculus.GRCm38.dna.toplevel.fa

2) Every record except 1 contains "END=" for example:

9 36758659 . C NON_REF . . END=36758689 GT:DP:GQ:MIN_DP:PL 0/0:1:3:1:0,3,38

The 1 record does not contain "END=":

"9 36758714 rs45725377 T C,<non_ref> 22.58 . DB;DP=2;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQ=7200.00 GT:AD:DP:GQ:PL:SB 1/1:0,2,0:2:6:49,6,0,49,6,49:0,0,1,1"

I have read What is a GVCF and how is it different from a 'regular' VCF? which states:

The second thing to look for is the END tag in the INFO field of non-variant block records.

Is it normal that every entry except for 1 is a non-variant block record? No errors are being returned. The only thing I can think of is that I ran the script on too small a test subset of the whole data. The test data is 2 fastq files of 10,000 lines.

I'm really not sure where I went wrong here. I appreciate any guidance.

Regards, Kenneth

ADD COMMENTlink modified 22 months ago • written 22 months ago by YaGalbi1.5k
gravatar for finswimmer
22 months ago by
finswimmer13k wrote:

Hello YaGalbi,

10,000 lines in a fastq file means you have 2500 reads per file. That is a very small dataset. Depending on the library prep this means you have a lot of non covered regions or just a very small region of interest. For both variants it is not realy surprising that you just found one position with a variant.

fin swimmer

ADD COMMENTlink written 22 months ago by finswimmer13k

Ok finswimmer...thanks for that. It is a whole exome dataset and I use gene coords as the intervals option.

I just created a larger subset of 1,000,000 lines and that should be done in a few hours so hopefully. Ill see a proportional increase.

EDIT: That didnt take as long as I thought - 30min (nice optimizing!). 6142 variants returned from 250000 reads. I think there is no problem.

ADD REPLYlink modified 22 months ago • written 22 months ago by YaGalbi1.5k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 743 users visited in the last hour