Question: Inferring genotype based on RNA sequnces
gravatar for omer.k
8 weeks ago by
omer.k20 wrote:

This is a bit of an I'm-not-sure question. So I've extracted human whole RNA, performed basic DNaseI treatment and integrity check. Reverse transcription was prompted by using a gene specific primer, followed by RNaseH treatment and then PCR. The cDNA amplicon was sequenced with an ONT MinION.

I'm running the alignment data through the SAMtools/BCFtools pipeline for SNV calling.

samtools mpileup -g -f genomes/some_mRNA_seq.fna alignments/sim_reads_aligned.sorted.bam > variants/sim_variants.bcf
bcftools call -c -v variants/sim_variants.bcf > variants/sim_variants.vcf

There are data supporting heterozygosity. The question is, assuming transcription occurred from both chromosomes, can I truly infer data regarding the genotype from merely RNA reads? Meaning homo/heterozygosity of the SNV suggested by the output in the VCF file. My logic tells me it's not so trivial and perhaps even wrong. I've tried searching for literature but have not found some well-established data.

I'd be happy to learn more so it'll be great of answers would be backed by research already done in this context.

snp rna-seq • 213 views
ADD COMMENTlink modified 4 weeks ago by igor6.5k • written 8 weeks ago by omer.k20

In general, transcription may not involve both the copies/alleles of a gene. Expression might be affected by expression in different ratios or one of them is imprinted or expression is low under the measured conditions. Whatever information comes from RNA-seq is limited to measurable, transcriptionally active copies under experiment.

ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by cpad01128.3k
gravatar for Kevin Blighe
8 weeks ago by
Kevin Blighe26k
USA / Europe / Brazil
Kevin Blighe26k wrote:

The argument for variant calling from RNA-seq data usually surrounds the fact that it can be cost-effective and negate the necessity to do both DNA- and RNA-seq (and use up both genomic and cDNA in the process). When I think about 'cost effectiveness' in broader terms, I realise that it invariably equates to a lower 'quality' and lower sensitivity or specificity (or both).

Whilst one can very easily call variants from RNA-seq data, one misses the following types of variants:

  • variants in alleles that are not expressed (obvious)
  • certain types of regulatory variants (obvious)
  • variants in genes that result in the gene undergoing non-sense mediated decay (NMD)
  • variants that result in haploinsufficiency
  • splicing variants

...and i'm sure that there are much more types that are missed.

Broad Institute have a 'best practices' pipeline for RNA-seq variant calling on their website, but it has never been published anywhere, much to my knowledge. When posted, it was also just tested on a single sample (their words). I take issue with this because many look up to the Broad as a reputable organisation. When they see the Broad publishing methods, they logically assume that the method must be okay to use and may not understand the limitations of such a method unless such limitations are clearly stated up front. They state some limitations lower down, raising concern:

Finally, we know that the current recommended pipeline is producing both false positives (wrong variant calls) and false negatives (missed variants) errors. While some of those errors are inevitable in any pipeline, others are errors that we can and will address in future versions of the pipeline.

The situation appears even more alarming when one reads anecdotal and published evidence of people who have compared RNA-seq variant calls to whole exome seq (WES) variant calls. Scattered across the WWW, I've seen that RNA-seq variant calling can only detect between ~30% and ~%70 of the variant calls that WES detects, and I assume that these people have obviously filtered the WES data to only include variants in exons in their comparisons.

I'm sorry but I do not and will never recommend variant calling from RNA-seq data based on current procedures. If you do it, you have to be absolutely sure of the limitations. A recent trend in research appears to be toward cost saving in various ways, but this will gradually bring more 'noise' into our data and result in larger problems further down the line.

Others likely have other opinions. Some that defend variant calling from RNA-seq data may be ones who have already performed it and published data from it.


ADD COMMENTlink modified 8 weeks ago • written 8 weeks ago by Kevin Blighe26k

Some time ago I have done a comparison of WES and RNA-Seq variant calling in a setting with more samples (92) but different software (GATK vs. Samtools) for the calling (which might introduce additional bias), but I confirm that the overlap between both methods is rather low. You may want to have a look at the poster we presented (see Figure 8).

However,I personally think that the mono-allelic expression (or exon not expressed) and NMD are valuable information you receive from RNA-Seq. If a mutation is not expressed and never manifested in the proteome, this mutation might actually be irrelevant for the phenotype (e.g., cancer). In our study, nearly 40% of WES-specific mutations felt into this "likely-irrelevant" category. So, in terms of quality, I believe a combination of both technologies is most valuable!

ADD REPLYlink written 8 weeks ago by Manuel Landesfeind1.1k

Thanks for adding your comments, Manuel Landesfeind. Agreed, great to have both DNA- and RNA-seq

ADD REPLYlink written 8 weeks ago by Kevin Blighe26k
gravatar for igor
4 weeks ago by
United States
igor6.5k wrote:

There has been a lot of debate about RNA- and DNA-seq variant calling, but there are surprisingly few proper benchmarks. If you are interested in the variant frequency, there is a study by Castle et al. where they specifically check that:

We found that 99% of the DNA mutations in expressed genes are expressed as RNA. Moreover, we found a high correlation between the DNA and RNA mutation allele frequency. Exceptions are mutations that cause premature termination codons and therefore activate nonsense-mediated decay. Beyond this, we did not find evidence of any wide-scale mechanism, such as allele-specific epigenetic silencing, preferentially promoting mutated or wild-type alleles. In conclusion, our data strongly suggest that genes are equally transcribed from all alleles, mutated and wild-type, and thus transcribed in proportion to their DNA allele frequency.

Of course, there are many caveats. One that is explicitly stated:

we found that different alignment algorithms introduced significant systematic biases in the determination of allele frequencies

ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by igor6.5k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1462 users visited in the last hour