Question

Population / Reference bias in GATK BQSR BaseRecalibrator?

0

Entering edit mode

4.4 years ago

Andrés Ribone ▴ 60

Hi, I work in sunflower (H. annuus) transcriptomics. I'm trying to genotype a big amount of RNAseqs samples from wild accessions (coming from distant places).

I have used the reference genome for alignment and resulted a not-so-terrible alignment rate (mean=92%,min=81%,max=95%). Now, for the BQSR step I have a reference SNP database (a big .vcf file) made from commercial cultivars (like the reference genome). Considering I expect important differences between the genomes and the reference, Is it OK to use the "reference" SNPs? Should I ignore it and use the VariantCalling->BQSR->VariantCalling approach? Or maybe a mixed approach where I feed the reference .vcf AND the new .vcf to the BQSR?

What I have googled about reference bias in variant calling, is that the bias starts in the alignment step (of course) and trikcles down from there; but not mention of this.

Now, I did a little test of both alternatives in a single sample: I called variants, hard filtered and bqsr ('self_snpdb'), parallel to just bqsr with the reference snpdb.

Firstly, only 37% of variant positions in the 'self_snpdb' where present in the 'reference_snpdb'; which make sense given the population divergence.

Then, I used GATK AnalyzeCovariates to compare the bqsr_tables. Resulting in these graphs: ('before' corresponds to the reference_db bqsr table; 'after' corresponds to the self_snpdb bqsr table)

There seems to be a clear difference, but I really don't know what to interpret from here.

1° set of graphs

2° set of graphs

Thanks for your time! Andrés

GATK BQSR population bias non model organism • 1.1k views

ADD COMMENT • link 4.4 years ago by Andrés Ribone ▴ 60