Question: identifying the expressed gene number in each timepoint
0
gravatar for mxlsherry1992
3 months ago by
mxlsherry199230 wrote:

Dear all, we have the RNA-SEQ data for 7 timepoints for a species, and this species has a reference genome. I found some papers will identify the gene number in each timepoint, but they used different method. 1. Some "an ad hoc cutoff for detectable expression was set at >=2 reads per transcript. Using this cutoff, 11187 gene transcripts could be detected in the RNA-Seq data set. " 2. Some using" we counted the number of clean reads aligned to litchi gene sequences and performed normalization using the RPkM method. After lowly expressed genes (< 5RPKM) were filtered, we identified 17572 genes in all samples"

It seems one of the method is using cutoff for the reads count directly, and another method works on the normalized data. So my first question is that which one should I use..(my species is channel catfish). And my second question is that, each of the method removed the low expressed genes, right? But for counting the expressed genes in different timepoints, I think there is no need for us to filter the low expressed genes (since these genes also expression...)

I will be appreciated if you could help solve this problem..

Thank you!!

rna-seq • 152 views
ADD COMMENTlink modified 9 weeks ago by dsull1.0k • written 3 months ago by mxlsherry199230

It depends on what you want to do. I do not think that RNA-seq alone allows for a confident statement about which gene is truly and reliably expressed, especially in terms of drawing a border between lowly-expressed genes and the transcription "background noise". If your aim is statistical (DEG) analysis I would focus on those genes that have sufficient statistical power to come out as significantly different at the given replicate number and read depth. This could be done using the FilterByExpr function from edgeR. Longer genes will be preferred though since no length correction is performed by default.

Personal opinion in this context: In the end I think that most high-throughput experiments are only and truly informative if doing any kind of comparative analysis where one has sufficient replicates, the replicates have been treated and processed identically and the analysis is statistically valid. Without that any changes one might see can be due to different read length, library prep technologies, signal/noise ration (=data quality), in silico processing etc. That comes down better making relative (comparative) statements with NGS experiments rather than absolute ones like here "exactly this number of genes is expressed in this sample".

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by ATpoint31k
0
gravatar for dsull
9 weeks ago by
dsull1.0k
UCLA
dsull1.0k wrote:
  1. No, do not use read counts directly (read counts depend on sequencing depth, among other things, so they tell you very little about "gene expression" on their own)

  2. The FPKM < 5 threshold is somewhat arbitrary but isn't disastrous in practice. See discussion about such cutoffs here: https://liorpachter.wordpress.com/2014/04/30/estimating-number-of-transcripts-from-rna-seq-measurements-and-why-i-believe-in-paywall/ As RNA-seq gives relative transcript abundances, I don't really know of any really good way to use the data to determine the "number of expressed genes".

ADD COMMENTlink written 9 weeks ago by dsull1.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1866 users visited in the last hour