Entering edit mode
3.4 years ago
igor • 0
I have a low coverage (1.0x) fastq sample.
I am following GATK best practices pipeline (bwa mem > Mark Duplicates > BQSR > Apply BQSR > HaplotypeCaller) to produce the VCF from fastq.
When comparing the resulted VCF with the ground truth, it performs well on homozygous variants but terribly on heterozygous (responsible for 99% mismatches).
(I also tried DeepVariant. I obtain similar results.)
How should I modify the pipeline for a low coverage sample? (Is there extra work required on .bam file or some special configuration of Haplotypecaller,...?)
Well, how would you identify a heterozygous variant with a 1x sample? You (theoretically) need at least 2x for it, not saying this is sufficient. This is a problem with ultra-low coverage which is not a good choice for this kind of analysis.
So I should just stick with homozygous and cutout the heterozygous? I am aware of the need of overlap of reads on each region especially if you want to identify the heterozygous variant. But I assumed/hoped that perhaps haplotypecaller (or similar) is able to infer the variant even if the read is not present, based on similar haplotypes contained in the pretrained model etc? Or is the low coverage data only useful to identify homozygous variants and then we infer what we can by imputation?
I have to make clear right away that I am no expert on variant calling and others might correct me but 1) I do not see how you could make reliable calls in the absence of information and 2) would not trust any calls from 1x samples. I guess you could take any calls that you get but then would need to verify by resequencing or Sanger anyway.
Your heterozygous calls are only valid for loci that reads overlap and with single-pass sequencing depth, you have no reliable source to judge about heterozygote loci. If you're working with the human genome, you can use https://gencove.com/ for imputation, but I am not sure whether 1x depth is sufficient for imputation of haplotypes!