Population / Reference bias in GATK BQSR BaseRecalibrator?
Entering edit mode
20 months ago

Hi, I work in sunflower (H. annuus) transcriptomics. I'm trying to genotype a big amount of RNAseqs samples from wild accessions (coming from distant places).

I have used the reference genome for alignment and resulted a not-so-terrible alignment rate (mean=92%,min=81%,max=95%). Now, for the BQSR step I have a reference SNP database (a big .vcf file) made from commercial cultivars (like the reference genome). Considering I expect important differences between the genomes and the reference, Is it OK to use the "reference" SNPs? Should I ignore it and use the VariantCalling->BQSR->VariantCalling approach? Or maybe a mixed approach where I feed the reference .vcf AND the new .vcf to the BQSR?

What I have googled about reference bias in variant calling, is that the bias starts in the alignment step (of course) and trikcles down from there; but not mention of this.

Now, I did a little test of both alternatives in a single sample: I called variants, hard filtered and bqsr ('self_snpdb'), parallel to just bqsr with the reference snpdb.

Firstly, only 37% of variant positions in the 'self_snpdb' where present in the 'reference_snpdb'; which make sense given the population divergence.

Then, I used GATK AnalyzeCovariates to compare the bqsr_tables. Resulting in these graphs: ('before' corresponds to the reference_db bqsr table; 'after' corresponds to the self_snpdb bqsr table)

There seems to be a clear difference, but I really don't know what to interpret from here.

1° set of graphs

2° set of graphs

Thanks for your time! Andrés

GATK BQSR population bias non model organism • 560 views

Login before adding your answer.

Traffic: 2004 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6