We are analyzing a WGS data of 60 samples (6 groups, 10 samples/group) produced by HiSeq4000. The mean coverage per sample is 25x (lowest sample is 15x).
Now we realized we need to sequence more samples in order to better estimate the allele frequencies. Due to budget and technical constrains we came down to sequence 90 samples (6 groups, 15 samples/group) at a target coverage of 5x. This time on a NovaSeq platform.
Now each group has 25 samples (10 from Hiseq4000 and 15 from NovaSeq).
Our aim is to do population analysis using SNP allele frequencies after combining the Hiseq4000 (25x coverage) data and the NovaSeq (5x coverage) data.
My plan for the new batch (NovaSeq - 5x) is to run it through the steps of GATK's best practices until
HaplotypeCaller and then combine it with the original batch (Hiseq4000 - 25x) using
CombineGVCFs and do joint calling with GenotypeGVCF.
I am working with mice samples, so I will do VQSR afterwards.
I have basically two questions:
- Is there an issue with doing joint variant calling and VQSR using information from different thechnologies?
- Would it be better to produce one VCF per batch and then merge them into one final VCF?
A similar thread is found here but data was produced with the same thechnology. Nonetheless, it is mentioned that different patterns of coverage could potentially create confusion in model building during VQSR.
I know this is not a "do this, do that" answer. I would appreciate comments and suggestions.
DISCLAIMER: I have posted this question on the gatk forum a while ago (~2mo), but they haven't had time to address my concerns. EDIT: I added a second question to the post.