Base recalibration for data cleanup
0
1
Entering edit mode
6.7 years ago

Hi all! I am following GATK best practices tutorial to perform the clean up of a DNAseq dataset of a non model organism (whole genome of a single individual). Everything was going ok until I arrived to the Base recalibration step (BSQR).

If there isn't a trustworthy SNPs databse available yet (which is my case), this is what GATK recommends: You can bootstrap a database of known SNPs. Here's how it works: 1-First do an initial round of SNP calling on your original, unrecalibrated data. 2-Then take the SNPs that you have the highest confidence in and use that set as the database of known SNPs by feeding it as a VCF file to the base quality score recalibrator. 3-Finally, do a real round of SNP calling with the recalibrated data. These steps could be repeated several times until convergence.

Can anyone provide further details on these steps? (the second step in particular).

Kind regards. Luciano

DNAseq data clean up • 2.1k views
ADD COMMENT
1
Entering edit mode

On what platform did the sequencing take place? There were some publications in the last years (and comments here at biostars) that state that BSQR has negligible or no beneficial effects at all, as today's sequencing platforms create very trustworthy base quality scores.

ADD REPLY
0
Entering edit mode

There were some publications in the last years (and comments here at biostars) that state that BSQR has negligible or no beneficial effects at all,

That does not surprise me; I have also found recalibration has little effect on variant calling in most cases.

as today's sequencing platforms create very trustworthy base quality scores.

I'd disagree with this, though. Illumina, in particular, tends to have extremely inaccurate quality scores on some platforms.

ADD REPLY
0
Entering edit mode

Ok, that was new to me. Are there any evaluations of base qualities available for the different Illuminas?

ADD REPLY
1
Entering edit mode

Here's an example for an early run on our NextSeq, compared to data from one of our HiSeq 2500's:

http://seqanswers.com/forums/showpost.php?p=156399&postcount=18

I've generated similar data for MiSeq, newer NextSeq runs (which are better than older ones), and NovaSeq, but they are kind of scattered around and I don't remember where they all are.

This is a link to our first NovaSeq run results, which has absurdly bad quality accuracy. That run also had an illumination failure, which excuses its low quality, but NOT the quality accuracy. However, a subsequent run did not have an illumination failure and the quality accuracy was extremely good (aside from the fact that it still only has 3 quality scores).

ADD REPLY
0
Entering edit mode

Thanks for your reply ATPoint, Sequencing was performed on a HiSeq 2500 with Sequencing v4 Chemistry. I was aware about the discussion on the real improvement of the dataset that Base Recalibration (BSQR) step provides. However still not very sure on how to generate a trustworthy vcf file to help me distinguish between real SNPs from sequencing errors.. I would appreciate if you have the cite for any paper discussing this issue. Thanks again!

ADD REPLY

Login before adding your answer.

Traffic: 3114 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6