Question: Population / Reference bias in GATK BQSR BaseRecalibrator?
gravatar for Andrés Ribone
12 months ago by
Andrés Ribone0 wrote:

Hi, I work in sunflower (H. annuus) transcriptomics. I'm trying to genotype a big amount of RNAseqs samples from wild accessions (coming from distant places).

I have used the reference genome for alignment and resulted a not-so-terrible alignment rate (mean=92%,min=81%,max=95%). Now, for the BQSR step I have a reference SNP database (a big .vcf file) made from commercial cultivars (like the reference genome). Considering I expect important differences between the genomes and the reference, Is it OK to use the "reference" SNPs? Should I ignore it and use the VariantCalling->BQSR->VariantCalling approach? Or maybe a mixed approach where I feed the reference .vcf AND the new .vcf to the BQSR?

What I have googled about reference bias in variant calling, is that the bias starts in the alignment step (of course) and trikcles down from there; but not mention of this.

Now, I did a little test of both alternatives in a single sample: I called variants, hard filtered and bqsr ('self_snpdb'), parallel to just bqsr with the reference snpdb.

Firstly, only 37% of variant positions in the 'self_snpdb' where present in the 'reference_snpdb'; which make sense given the population divergence.

Then, I used GATK AnalyzeCovariates to compare the bqsr_tables. Resulting in these graphs: ('before' corresponds to the reference_db bqsr table; 'after' corresponds to the self_snpdb bqsr table)

There seems to be a clear difference, but I really don't know what to interpret from here.

1° set of graphs

2° set of graphs

Thanks for your time! Andrés

ADD COMMENTlink written 12 months ago by Andrés Ribone0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1805 users visited in the last hour