Question: No reference bias?!
1
gravatar for zhuozhu132
5.8 years ago by
zhuozhu13230
United States
zhuozhu13230 wrote:

Hey Biostars!

So it's accepted that reads with reference allele have better chances to align to the reference genome, and this bias will cause artifacts in estimating allelic expression. Surprisingly in my data I found no reference bias, at all. The average reference ratio (=ref counts/total counts) across all heterozygous loci is 49.5%.  My RNA-Seq reads were aligned to reference by using STAR, and then variants were identified by following GATK best practices. When I calculated the reference ratio at each locus, I used the read counts directly from the vcf output. 

Does this make any sense to you? Maybe I did something wrong? My thought is, variant calling favors the loci with alternative allele, maybe this will cause reference bias reduced. 

Hope to hear your insight! Thank you!

snp rna-seq • 1.7k views
ADD COMMENTlink modified 5.8 years ago by karl.stamm3.9k • written 5.8 years ago by zhuozhu13230
3
gravatar for karl.stamm
5.8 years ago by
karl.stamm3.9k
United States
karl.stamm3.9k wrote:

"My thought is, variant calling favors the loci with alternative allele, maybe this will cause reference bias reduced. "

That's exactly it. You used the same sequencing experiment to find expression and variants. Consider the extreme situations and how they would be interpreted:

If a heterozygous site is expressing completely REF allele, your GATK-BP would miss it. No variant seen, called hom-ref. 

If a heterozygous site is expressing completely ALT allele, your GATK-BP would call it homozygous with normal allelic expression.

If a het site was expressing 90%/10%, then GATK-BP might assign a low quality score, and filter it away as an uncertain variant.

What this means is you need an independent determination of sample genotypes, either by genome sequencing, exome sequencing, or microarray. Using the RNA-Seq will inherently miss the most severe allelic bias. You could probably re-tune GATK-BP to report more 90%/10% situations, but you'll get a lot of false positives that way too. 

 

ADD COMMENTlink written 5.8 years ago by karl.stamm3.9k

Thank you very much for the reply, Karl! Actually the heterozygous sites I used were identified by genome sequencing. It's really strange to me.  

ADD REPLYlink written 5.8 years ago by zhuozhu13230

The original question says "My RNA-Seq reads were aligned to reference by using STAR, and then variants were identified by following GATK best practices."  If that isn't what happened, you'll have to be more clear. Now there's another genome sequencing? Wont most het sites be non-coding or non-expressed? Maybe it's a simple filtering mistake that leads us back to the RNA expression bias. 

ADD REPLYlink written 5.8 years ago by karl.stamm3.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2110 users visited in the last hour
_