Question: Fastq to VCF pipeline configuration for low coverage data
gravatar for igor
16 months ago by
igor0 wrote:


I have a low coverage (1.0x) fastq sample.

I am following GATK best practices pipeline (bwa mem > Mark Duplicates > BQSR > Apply BQSR > HaplotypeCaller) to produce the VCF from fastq.

When comparing the resulted VCF with the ground truth, it performs well on homozygous variants but terribly on heterozygous (responsible for 99% mismatches).

(I also tried DeepVariant. I obtain similar results.)

How should I modify the pipeline for a low coverage sample? (Is there extra work required on .bam file or some special configuration of Haplotypecaller,...?)


deepvariant gatk • 325 views
ADD COMMENTlink written 16 months ago by igor0

Well, how would you identify a heterozygous variant with a 1x sample? You (theoretically) need at least 2x for it, not saying this is sufficient. This is a problem with ultra-low coverage which is not a good choice for this kind of analysis.

ADD REPLYlink written 16 months ago by ATpoint46k

So I should just stick with homozygous and cutout the heterozygous? I am aware of the need of overlap of reads on each region especially if you want to identify the heterozygous variant. But I assumed/hoped that perhaps haplotypecaller (or similar) is able to infer the variant even if the read is not present, based on similar haplotypes contained in the pretrained model etc? Or is the low coverage data only useful to identify homozygous variants and then we infer what we can by imputation?

ADD REPLYlink written 16 months ago by igor0

I have to make clear right away that I am no expert on variant calling and others might correct me but 1) I do not see how you could make reliable calls in the absence of information and 2) would not trust any calls from 1x samples. I guess you could take any calls that you get but then would need to verify by resequencing or Sanger anyway.

ADD REPLYlink written 16 months ago by ATpoint46k

Your heterozygous calls are only valid for loci that reads overlap and with single-pass sequencing depth, you have no reliable source to judge about heterozygote loci. If you're working with the human genome, you can use for imputation, but I am not sure whether 1x depth is sufficient for imputation of haplotypes!

ADD REPLYlink written 16 months ago by reza.jabal440
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1479 users visited in the last hour