Question: which kind of gene_biotype should we usaully remove?
0
gravatar for R
6.0 years ago by
R10
....
R10 wrote:

Hi

The RNA-seq data which I work with are ribosomal RNA depleted libraries, meaning they contain ncRNAs, snRNAs etc... in addition to mRNAs. To filter ensemble gtf file before counting, which kind of gene_biotype should I remove?

high-abundance RNAs including mt-RNA,rRNA, snRNA, snoRNA, tRNA, histone RNAs ....?

pseudogene?

 

rna-seq • 2.9k views
ADD COMMENTlink modified 6.0 years ago by komal.rathi3.6k • written 6.0 years ago by R10
3
gravatar for pld
6.0 years ago by
pld4.8k
United States
pld4.8k wrote:

I'd filter on length rather than biotype, you can't really detect things smaller than your read size. By definition this will (usually) exclude things like miRNAs, tRNA, snoRNA, etc. Other then that, you might not want to include psuedogenes.
 

ADD COMMENTlink written 6.0 years ago by pld4.8k

Thanks, so which length do you usually use as cutoff or based on which criterion? if I have 50 bp reads , then I should use it as cutoff?

ADD REPLYlink modified 6.0 years ago • written 6.0 years ago by R10

How long are your reads?

ADD REPLYlink written 6.0 years ago by pld4.8k

Single end, all reads between 39-42 bp

and how about histone RNAs?

ADD REPLYlink modified 6.0 years ago • written 6.0 years ago by R10

If you can capture them with you sequencing, why not detect them? I realize the analysis can become more complex if one expands out of your typical pool of mRNA. However, the goal of RNA-Seq is to characterize the approximate fold change of RNA species present in your biological source (cells, tissue, etc).

I think narrowing the classes of RNA you are considering a priori is bad science. If you throw out a class of RNA you are effectively saying that the class has no biological role in what you are studying. There's no good reason that I can see for filtering biotypes for anything other than size.

ADD REPLYlink written 6.0 years ago by pld4.8k
1
gravatar for komal.rathi
6.0 years ago by
komal.rathi3.6k
Children's Hospital of Philadelphia, Philadelphia, PA
komal.rathi3.6k wrote:

In the Ensembl gtf file, there are many types of genes:

3prime_overlapping_ncrna 
antisense 
IG_C_gene 
IG_D_gene 
IG_J_gene 
IG_LV_gene 
IG_V_gene 
IG_V_pseudogene 
lincRNA 
miRNA 
misc_RNA 
Mt_rRNA 
Mt_tRNA 
polymorphic_pseudogene 
processed_transcript 
protein_coding 
pseudogene 
rRNA 
sense_intronic 
sense_overlapping 
snoRNA 
snRNA 
TR_V_gene 
TR_V_pseudogene 

Out of these, we usually keep protein_coding & lincRNA because we are interested in identifying differentially expressed and novel lincRNAs. Once we also kept pseudogene, antisense & miRNA because our aim was to identify whether such genes are differentially expressed or not, and if that's the case then find whether they are near any of the differentially expressed protein-coding genes (to correlate whether a pseudogene, antisense or miRNA is regulating a protein-coding gene). So depending on what your aim is, you may filter out different gene types. We usually apply a secondary filter depending on the "expected" length of the gene (filtering out lincRNAs that are <200 bp long and so on).

ADD COMMENTlink written 6.0 years ago by komal.rathi3.6k

 Thank you very much. very helpful.

ADD REPLYlink written 6.0 years ago by R10

And how about histone RNAs?

ADD REPLYlink written 6.0 years ago by R10

Does your Ensembl GTF have a value like that in gene_biotype field? I have never come across it (or may have missed it).

ADD REPLYlink written 6.0 years ago by komal.rathi3.6k

No, not in the GTF file. I meant to remove those RNAs which come from histones. In my final differentially expressed genes, I have a lot of Hist3.., Hist4..., Hist1..., ....

ADD REPLYlink modified 6.0 years ago • written 6.0 years ago by R10

It depends on what you are trying to achieve, what's your final goal?

ADD REPLYlink written 6.0 years ago by komal.rathi3.6k

I did not expect a lots of them as differentially expressed genes, I thought may be my normalization was not correct!! Thats the case. RPKM calculation shows no change but DESeq 100 fold!!!!!

ADD REPLYlink modified 6.0 years ago • written 6.0 years ago by R10

Depending on the cells you have and what you're studying, it might make sense that there is differential expressions of histones. As always, qPCR is a great way to double check.

ADD REPLYlink written 6.0 years ago by pld4.8k

thanks, I will check them by qPCR

ADD REPLYlink written 6.0 years ago by R10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1504 users visited in the last hour