Question

No reference bias?!

1

Entering edit mode

9.1 years ago

zhuozhu132 ▴ 30

Hey Biostars!

So it's accepted that reads with reference allele have better chances to align to the reference genome, and this bias will cause artifacts in estimating allelic expression. Surprisingly in my data I found no reference bias, at all. The average reference ratio (=ref counts/total counts) across all heterozygous loci is 49.5%. My RNA-Seq reads were aligned to reference by using STAR, and then variants were identified by following GATK best practices. When I calculated the reference ratio at each locus, I used the read counts directly from the vcf output.

Does this make any sense to you? Maybe I did something wrong? My thought is, variant calling favors the loci with alternative allele, maybe this will cause reference bias reduced.

Hope to hear your insight! Thank you!

SNP RNA-Seq • 2.2k views

ADD COMMENT • link updated 23 months ago by Ram 43k • written 9.1 years ago by zhuozhu132 ▴ 30

Ram · Answer 1 · 2015-03-20

3

Entering edit mode

9.1 years ago

karl.stamm 4.1k

My thought is, variant calling favors the loci with alternative allele, maybe this will cause reference bias reduced.

That's exactly it. You used the same sequencing experiment to find expression and variants. Consider the extreme situations and how they would be interpreted:

If a heterozygous site is expressing completely REF allele, your GATK-BP would miss it. No variant seen, called hom-ref.

If a heterozygous site is expressing completely ALT allele, your GATK-BP would call it homozygous with normal allelic expression.

If a het site was expressing 90%/10%, then GATK-BP might assign a low quality score, and filter it away as an uncertain variant.

What this means is you need an independent determination of sample genotypes, either by genome sequencing, exome sequencing, or microarray. Using the RNA-Seq will inherently miss the most severe allelic bias. You could probably re-tune GATK-BP to report more 90%/10% situations, but you'll get a lot of false positives that way too.

ADD COMMENT • link updated 23 months ago by Ram 43k • written 9.1 years ago by karl.stamm 4.1k

0

Entering edit mode

Thank you very much for the reply, Karl! Actually the heterozygous sites I used were identified by genome sequencing. It's really strange to me.

ADD REPLY • link updated 23 months ago by Ram 43k • written 9.1 years ago by zhuozhu132 ▴ 30

0

Entering edit mode

The original question says "My RNA-Seq reads were aligned to reference by using STAR, and then variants were identified by following GATK best practices." If that isn't what happened, you'll have to be more clear. Now there's another genome sequencing? Wont most het sites be non-coding or non-expressed? Maybe it's a simple filtering mistake that leads us back to the RNA expression bias.

ADD REPLY • link updated 23 months ago by Ram 43k • written 9.1 years ago by karl.stamm 4.1k