Question: HaplotypeCaller output - All records are non-variant - Have I gone wrong somewhere?
gravatar for YaGalbi
9 months ago by
Biocomputing, MRC Harwell Institute, Oxford, UK
YaGalbi1.4k wrote:

Hello everyone,

Following the GATK germline short variant discovery instructions, I have written a script that runs the pipeline from fastq pre-processing through to creating a gvcf with HaplotypeCaller. On inspection I have noticed that my GVCF:

1) Does not have a ##reference entry in the header even though i did pass a reference in the command

--reference /NGS/musRefs_10/gatk/ref/ensembl92/Mus_musculus.GRCm38.dna.toplevel.fa

2) Every record except 1 contains "END=" for example:

9 36758659 . C NON_REF . . END=36758689 GT:DP:GQ:MIN_DP:PL 0/0:1:3:1:0,3,38

The 1 record does not contain "END=":

"9 36758714 rs45725377 T C,<non_ref> 22.58 . DB;DP=2;ExcessHet=3.0103;MLEAC=2,0;MLEAF=1.00,0.00;RAW_MQ=7200.00 GT:AD:DP:GQ:PL:SB 1/1:0,2,0:2:6:49,6,0,49,6,49:0,0,1,1"

I have read What is a GVCF and how is it different from a 'regular' VCF? which states:

The second thing to look for is the END tag in the INFO field of non-variant block records.

Is it normal that every entry except for 1 is a non-variant block record? No errors are being returned. The only thing I can think of is that I ran the script on too small a test subset of the whole data. The test data is 2 fastq files of 10,000 lines.

I'm really not sure where I went wrong here. I appreciate any guidance.

Regards, Kenneth

ADD COMMENTlink modified 9 months ago • written 9 months ago by YaGalbi1.4k
gravatar for finswimmer
9 months ago by
finswimmer11k wrote:

Hello YaGalbi,

10,000 lines in a fastq file means you have 2500 reads per file. That is a very small dataset. Depending on the library prep this means you have a lot of non covered regions or just a very small region of interest. For both variants it is not realy surprising that you just found one position with a variant.

fin swimmer

ADD COMMENTlink written 9 months ago by finswimmer11k

Ok finswimmer...thanks for that. It is a whole exome dataset and I use gene coords as the intervals option.

I just created a larger subset of 1,000,000 lines and that should be done in a few hours so hopefully. Ill see a proportional increase.

EDIT: That didnt take as long as I thought - 30min (nice optimizing!). 6142 variants returned from 250000 reads. I think there is no problem.

ADD REPLYlink modified 9 months ago • written 9 months ago by YaGalbi1.4k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1191 users visited in the last hour