Inferring genotype based on RNA sequences (RNA-seq variant calling)
Entering edit mode
5.4 years ago
omer.k ▴ 90

This is a bit of an I'm-not-sure question. So I've extracted human whole RNA, performed basic DNaseI treatment and integrity check. Reverse transcription was prompted by using a gene specific primer, followed by RNaseH treatment and then PCR. The cDNA amplicon was sequenced with an ONT MinION.

I'm running the alignment data through the SAMtools/BCFtools pipeline for SNV calling.

samtools mpileup -g -f genomes/some_mRNA_seq.fna alignments/sim_reads_aligned.sorted.bam > variants/sim_variants.bcf
bcftools call -c -v variants/sim_variants.bcf > variants/sim_variants.vcf

There are data supporting heterozygosity. The question is, assuming transcription occurred from both chromosomes, can I truly infer data regarding the genotype from merely RNA reads? Meaning homo/heterozygosity of the SNV suggested by the output in the VCF file. My logic tells me it's not so trivial and perhaps even wrong. I've tried searching for literature but have not found some well-established data.

I'd be happy to learn more so it'll be great of answers would be backed by research already done in this context.

RNA-Seq SNP • 8.7k views
Entering edit mode

In general, transcription may not involve both the copies/alleles of a gene. Expression might be affected by expression in different ratios or one of them is imprinted or expression is low under the measured conditions. Whatever information comes from RNA-seq is limited to measurable, transcriptionally active copies under experiment.

Entering edit mode
5.4 years ago

The argument for variant calling from RNA-seq data usually surrounds the fact that it can be cost-effective and negate the necessity to do both DNA- and RNA-seq (and use up both DNA and RNA / mRNA / cDNA in the process). When I think about 'cost effectiveness' in broader terms, I realise that it invariably equates to a lower 'quality' and lower sensitivity or specificity (or both) compared to gold standards.

Whilst one can very easily call variants from RNA-seq data, one misses the following types of variants:

  • variants in alleles that are not expressed (obvious)
  • certain types of regulatory variants (obvious)
  • variants in genes that result in the gene undergoing non-sense mediated decay (NMD) (these may still be detected depending on the wet-lab capture method employed)
  • variants that result in haploinsufficiency
  • splicing variants
  • variants in low-expressed genes, e.g., non-coding genes (could be detected at low read-depths, but then you would introduce false-positive variant calls elsewhere)

...and I'm sure that there are much more types that are missed.

Broad Institute have a 'best practices' pipeline for RNA-seq variant calling on their website, but it has never been published anywhere in a scientific journal, much to my knowledge. When posted, it was also just tested on a single sample (their words). I take issue with this because many look up to the Broad as a reputable organisation. When they see the Broad outlining methods on their web-site, they logically assume that the method must be okay to use and may not understand the limitations of such a method unless such limitations are clearly stated up front. They state some limitations lower down, raising concern:

Finally, we know that the current recommended pipeline is producing both false positives (wrong variant calls) and false negatives (missed variants) errors. While some of those errors are inevitable in any pipeline, others are errors that we can and will address in future versions of the pipeline.


The situation appears even more alarming when one reads anecdotal and published evidence of people who have compared RNA-seq variant calls to whole exome seq (WES) variant calls. Scattered across the WWW, I've seen that RNA-seq variant calling can only detect between ~30% and ~70% of the variant calls that WES detects, and I assume that these people have obviously filtered the WES data to only include, in their comparisons, variants in exons that could be assumed to also be detected from RNA-seq reads.

So, if you do variant calling from RNA-seq, you have to be absolutely sure of the limitations. A recent trend in research appears to be toward cost saving in various ways, but this will gradually bring more 'noise' into our data and result in larger problems further down the line.

Others likely have other opinions. Some that defend variant calling from RNA-seq data may be ones who have already performed it and published data from it. On that note, keep in mind, in addition, that most journals are profit-oriented and need to publish works in order to survive. The field is also now 'flooded' with bogus journals that will publish anything, fantasy or otherwise.


Entering edit mode

Some time ago I have done a comparison of WES and RNA-Seq variant calling in a setting with more samples (92) but different software (GATK vs. Samtools) for the calling (which might introduce additional bias), but I confirm that the overlap between both methods is rather low. You may want to have a look at the poster we presented (see Figure 8).

However,I personally think that the mono-allelic expression (or exon not expressed) and NMD are valuable information you receive from RNA-Seq. If a mutation is not expressed and never manifested in the proteome, this mutation might actually be irrelevant for the phenotype (e.g., cancer). In our study, nearly 40% of WES-specific mutations felt into this "likely-irrelevant" category. So, in terms of quality, I believe a combination of both technologies is most valuable!

Entering edit mode

Thanks for adding your comments, Manuel Landesfeind. Agreed, great to have both DNA- and RNA-seq

Entering edit mode
5.4 years ago
igor 13k

There has been a lot of debate about RNA- and DNA-seq variant calling, but there are surprisingly few proper benchmarks. If you are interested in the variant frequency, there is a study by Castle et al. where they specifically check that:

We found that 99% of the DNA mutations in expressed genes are expressed as RNA. Moreover, we found a high correlation between the DNA and RNA mutation allele frequency. Exceptions are mutations that cause premature termination codons and therefore activate nonsense-mediated decay. Beyond this, we did not find evidence of any wide-scale mechanism, such as allele-specific epigenetic silencing, preferentially promoting mutated or wild-type alleles. In conclusion, our data strongly suggest that genes are equally transcribed from all alleles, mutated and wild-type, and thus transcribed in proportion to their DNA allele frequency.

Of course, there are many caveats. One that is explicitly stated:

we found that different alignment algorithms introduced significant systematic biases in the determination of allele frequencies

Entering edit mode
2.9 years ago

Integrated analyses can give more insights such as "SNPs resulting from post-transcriptional modifications". explained in this article

SNPs resulting from post-transcriptional modifications, such as RNA editing, which may reveal potentially functional variation that would have otherwise been missed in genomic data


Login before adding your answer.

Traffic: 2062 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6