Are there any issues with using sample reads coming from different technologies and having different depths to do joint variant calling (gatk)?
1
0
Entering edit mode
4.2 years ago
serpalma.v ▴ 80

Hello!

We are analyzing a WGS data of 60 samples (6 groups, 10 samples/group) produced by HiSeq4000. The mean coverage per sample is 25x (lowest sample is 15x).

Now we realized we need to sequence more samples in order to better estimate the allele frequencies. Due to budget and technical constrains we came down to sequence 90 samples (6 groups, 15 samples/group) at a target coverage of 5x. This time on a NovaSeq platform.

Now each group has 25 samples (10 from Hiseq4000 and 15 from NovaSeq).

Our aim is to do population analysis using SNP allele frequencies after combining the Hiseq4000 (25x coverage) data and the NovaSeq (5x coverage) data.

My plan for the new batch (NovaSeq - 5x) is to run it through the steps of GATK's best practices until HaplotypeCaller and then combine it with the original batch (Hiseq4000 - 25x) using CombineGVCFs and do joint calling with GenotypeGVCF.

I am working with mice samples, so I will do VQSR afterwards.

I have basically two questions:

  • Is there an issue with doing joint variant calling and VQSR using information from different thechnologies?
  • Would it be better to produce one VCF per batch and then merge them into one final VCF?

A similar thread is found here but data was produced with the same thechnology. Nonetheless, it is mentioned that different patterns of coverage could potentially create confusion in model building during VQSR.

I know this is not a "do this, do that" answer. I would appreciate comments and suggestions.

DISCLAIMER: I have posted this question on the gatk forum a while ago (~2mo), but they haven't had time to address my concerns. EDIT: I added a second question to the post.

gatk SNP variant calling hiseq novaseq • 1.5k views
ADD COMMENT
1
Entering edit mode

Illumina's guidance used to be "sequence produced should always be considered equivalent irrespective of type of sequencer used". This may be more or less true even now when we have expanded to 1-, 2- and 4 color chemistries from just 4-color. We had done some benchmarking for RNAseq data and results were identical when the same set of libraries were run on HiSeq 4000 and NovaSeq.

Have you done the sequencing for second round? If not I suggest that you add a few samples from last batch of HiSeq 4000 to get equivalent NovaSeq results. That should allow you to do direct comparisons on that subset of samples. You would want to try and keep other parameters (coverage etc) as equivalent as you can to prevent additional batch effects (if any).

ADD REPLY
0
Entering edit mode

But this is variant calling. There might be letter calling biases between the platforms, which would not affect RNASeq so much.

ADD REPLY
0
Entering edit mode

Possibly. That is the reason of my recommendation that OP include some past samples in new NovaSeq run to test that possibility directly. Sequencing being done is shallow (5x) so the results are not going to identify rare SNP.

ADD REPLY
0
Entering edit mode

We did not add samples from the last batch into the new batch

ADD REPLY
0
Entering edit mode

So you already did the sequencing?

ADD REPLY
0
Entering edit mode

yes, we just got the new data

ADD REPLY
0
Entering edit mode
4.2 years ago
bari.ballew ▴ 460

The answer will depend on your experimental design and the hypotheses you are testing.

If, for example, you have cases and controls in both batches, and you want to run a pairwise burden test to determine whether cases are enriched/depleted for variation in certain genes, then you ought to run the test independently on each batch, and then do a meta-analysis to combine. On the other hand, if you are trying to perform the same experiment but cases were sequenced in one batch and controls in the other, then there isn't much you can do to establish that any difference detected by your burden test is due to biological reality and not batch effect.

I would be wary of combining data that was not sequenced concurrently (though it happens all the time). Instead of calling everything jointly, you may want to call each set independently and set QC filtering criteria that are specific to each dataset, then combine calls. You might want to force call all positions (or call in gVCF mode) to avoid the problem of missing data upon combining the VCFs. Alternatively, you might consider using your deeper coverage dataset as the "discovery" set, and the lower coverage dataset as the "validation" set. Whatever you do, you should definitely perform targeted validation of anything you discover.

ADD COMMENT

Login before adding your answer.

Traffic: 2301 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6