Question: Gene Expression In Rnaseq Data
gravatar for Yahan
7.6 years ago by
Yahan370 wrote:

This question is rather basic.

When is a gene considered to be expressed in an comparative rnaSeq experiment?

I have RPKM values for each gene.

When this value is near to zero, is it considered to be expressed, or is there another explanation why there is a trace of the transcript?

gene rna • 5.3k views
ADD COMMENTlink written 7.6 years ago by Yahan370

Are you interested in differential analysis, or simply in evidence of transcripts, that is not clear. For DE analysis you are better off using the raw counts. At least for DEseq or edgeR. The packages will internally compute normalization.

ADD REPLYlink written 7.6 years ago by Michael Dondrup44k

I am interested finding genes that are expressed only in one tissue and not in the others. So I would say it is differential, but in an absolute way, No up & down regulation.

ADD REPLYlink written 7.6 years ago by Yahan370

The problem is you cannot detect that something is not expressed just because you have no reads.

ADD REPLYlink written 7.6 years ago by Michael Dondrup44k
gravatar for Michael Dondrup
7.6 years ago by
Bergen, Norway
Michael Dondrup44k wrote:

Edit after noticing that this is mainly about differential RNA-seq analysis:

First and foremost, to assess significance you need biological replicates, only replicates grant you with an estimate of variance, this has been treated for example in this question:

Second, I would like to mention that you cannot prove absolutely that a gene is not expressed only because one hasn't found evidence (a non-existance proof is not feasible here).

For computing p-values of differential expression I recommend R packages DEseq or edgeR. Some of this I have explained in this answer already, there are links to other materials and papers:

However, it is definitely a problem if one gene has very few or zero counts in one or more group and the current methods might not be able to assign p-values properly or at all in these cases.

If I understand you correctly, you want to know if a very small number of reads (say at least one) in an RNA-seq experiment is evidence for the region being transcribed (not necessarily expressed).

Yes, every single sequence and it's alignment is evidence in itself, given the sequencer or protocol doesn't make up sequences! We have to agree on this point: the sequence doesn't lie, but ofc there can be errors.

Of course you would like to have more evidence and so for very lowly covered exons you will have to study them more deeply.

Where could the reads come from:

  • They could orginate from a duplicated/highly similar or repetetive region
  • They could be poor alignments of reads with many sequencing errors
  • The sequences could be contaminations with vectors, adaptors

To prove your gene being transcribed you have to take a look at the individual alignments:

  • Filter alignments for duplicate hits to the genome, do you still get coverage
  • Look at the single alignments, how good are they, large in-dels?
  • apply quality filtering (after removing duplicates, not before)
  • look for protocol specific contamination
  • look at where in the gene the alignments are: are they all in one locus or do they span exons/ introns?
  • re-align the reads against the genome using a more sensitive aligner e.g.(FASTA or SSearch). Do they still align only a single position?

Hope this helps.

ADD COMMENTlink modified 7.6 years ago • written 7.6 years ago by Michael Dondrup44k

Thanks a lot for the extensive reply Michael.

You are right that absence of reads doesn't prove absence of transcription, especially for genes with low expression. On the other hand it is at least and indication that it could be the case.

I will start with by assembly on a partial reference genome and use coverage to find some genes of interest, which hopefully will be present.

After that I can apply the cross checks you mention here.

ADD REPLYlink written 7.6 years ago by Yahan370
gravatar for Mmorine
7.6 years ago by
Mmorine280 wrote:

If you have an RPKM close to zero, the simplest explanation would be that either the gene is unexpressed or alternatively because you haven't achieved sufficient sequencing depth to detect it. In generation of RNA-seq data, there's a certain margin of error in both the base calling and read mapping. With any read mapping software the user chooses the number of allowable mis-matched bases in a mapped read. If you set this number too high you're likely to end up with a number of improperly aligned reads, which could lead falsely detected low-expression genes (however the default settings are usually low enough to avoid this). If you think this might be the case, you could remap your reads with a stricter mis-match threshold and see if this changes the results.

If you're able, you should also have a look at the raw read counts for each gene; since RPKM is a measure that is normalized for a) number of mapped reads in a given sample and b) length of the transcript, what you'll often find is that genes with RPKM close to zero have raw read counts that are a bit higher. I've used ERANGE in the past, which returns both RPKM and raw counts for each gene.

When it comes to biological interpretation these low expression genes are problematic. Take the example of a gene with an average of 2 reads mapped in 'control' samples and 4 reads mapped in 'treatment' samples. It's possible that differential expression analysis will yield a statistically significant result here, but the biological meaning is ambiguous.

ADD COMMENTlink modified 7.6 years ago • written 7.6 years ago by Mmorine280
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2079 users visited in the last hour