Question: Rna Seq Mrna|Gene Count Data Filteration
gravatar for Sudeep
8.4 years ago by
Sudeep1.6k wrote:

Hi all,

I have a question regarding rna-seq data filtering, once sample mrna to known mrna mapping and filtering is done, is it a good idea to remove mrna or gene hits that rarely occurs in all the samples ? I read about it in this edgeR tutorial, where the gene hits with less than rpm count 1 occurring in less than two samples are removed . If it is a good idea what would be the common methods to look for ?

thanks in advance

sequencing filter next-gen rna data • 3.9k views
ADD COMMENTlink modified 8.1 years ago by Duff660 • written 8.4 years ago by Sudeep1.6k
gravatar for Philippe
8.4 years ago by
Barcelona, Spain.
Philippe1.9k wrote:


I am generally not at ease with filtering, moreover with RNA-Seq data. This makes quite some sense for microarray data since non-expressed genes also had a low intensity signal (even though there is no gold standard methods of filtering).

For RNA-Seq, there is no such drawback and the presence of at least one unambiguously mapped read on a gene should normally reflect an evidence of transcription. Filtering very lowly transcribed genes makes you assume that those genes are not functional but rather transcriptional noise. That might be true for some cases (maybe most of them) but there is, to my knowledge, no clear evidence about that.

My opinion would rather be to control if the results of your analyses are not biased by such genes, dividing your initial gene set in several bins of expression. If the bins containing lowly expressed genes show a pattern similar to bins containing genes with intermediate or high expression this shows that rare transcripts do not influence the results of your analysis. On the contrary, if such differences are observed it is more difficult to draw conclusions since the difference could be explained by several biological or technical parameters that can not necessarily be distinguished.

ADD COMMENTlink modified 8.4 years ago • written 8.4 years ago by Philippe1.9k

thank you, this is a good suggestion

ADD REPLYlink written 8.4 years ago by Sudeep1.6k
gravatar for Duff
8.1 years ago by
United Kingdom
Duff660 wrote:

Hi Sudeep

I think that the edgeR authors recommend filtering the data such that genes with less than one count across half the samples are removed because they cannot achieve statistical significance. In the first vignette example in the edgeR documentation they say:

"We will filter out very lowly expressed tags. Those which have fewer than 5 counts in total cannot possibly achieve statisical significance for DE, so we filter out these tags."

So, if using edgeR why keep in genes that are expressed at low levels (and may be expressed in some samples but not others under the same conditions) and that can't give you any information about regulation? These just become statistical noise. The biology however may be relevant as Philippe points out - it's just you can't say anything statistically regarding regulation (well with edgeR anyway).



ADD COMMENTlink written 8.1 years ago by Duff660

Thank you, this is what I have done finally

ADD REPLYlink written 8.1 years ago by Sudeep1.6k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 828 users visited in the last hour