How to annotate coding and long non-coding RNA together using transcriptome data. My idea is to find co-expressed lncRNA with my target gene
4
0
Entering edit mode
8.6 years ago
mjoyraj ▴ 80

I am a Biologist and a novice in analysis of NGS data. I have a set of six transcriptomes. I want to find the expression of the coding genes as well as lnRNA in each set and than compare them to find the co-expressed clusters. For that I need the FPKM of coding genes as well lncRNA. I have the experience of using Tophat + Cufflinks based De-Novo and RABT assembly and find FPKM of coding genes. But how to annotate the lncRNA. In Tophat + Cuuflinks mapping and assembly, the genes are assembled based on the supplied GTF file while novel cases like novel genes or isoform of existing genes are found based on novel junctions. Whether the lncRNA co-ordinates will also be present in the GTF file..??

RNA-Seq Assembly next-gen • 7.2k views
ADD COMMENT
2
Entering edit mode
8.6 years ago
DG 7.3k

As long as you provide a GTF file that contains both coding and lncRNA you should be fine. If you use the GENCODE annotations, it contains both. The latest release for GRCh37 is GENCODE 19: http://www.gencodegenes.org/releases/19.html

There are newer versions for GRCh38 if you are using that reference in your analysis.

ADD COMMENT
0
Entering edit mode

My transcriptome is not from human. It is from avian species Taniophygia guttata (Finch). I guess the GENCODE only contains data for human and mouse. How can I find the co-ordinates of lncRNA in GTF format for the said species. If not available, I think I had to adopt de-novo assembly approach. Do you have any suggestion for that..

ADD REPLY
0
Entering edit mode

There may be similar projects or data out there for your species of interest. You would need to check around the various genomics resources or people doing genomics on finch to see if that is the case. If nothing is known about lncRNA in your species than your transcripts would need to be annotated by homology searching to the closest relative with data on lncRNA.

For de novo Assembly Trinity is quite popular and there is also a newer program called Sailfish that is supposed to be interesting for isoform abundance. How either deals with ncRNA though I am not sure. You would need to investigate to see what they are doing. They should definitely fall out of a trinity assembly since they are long enough.

ADD REPLY
0
Entering edit mode

Thanks for your suggestion. As I searched in the literature, I found nothing is known about the lncRNA of my species of interest as well as any close relative of it.

So my plan is to predict the putative lncRNA. I will use Cufflinks RABT assembly approach to assemble the known as well as novel transcripts. Then I will check whether the assembled transcript co-ordinates falls in the exonic region, intronic region or intergenic region of the reference genome. Those transcripts falling in the intronic and intergenic region may be the putative lncRNA. Next, I will examine the coding potential of the predicted lncRNA. Do you think this approach is okay to predict the lncRNA's?

ADD REPLY
0
Entering edit mode

Seems reasonable to me. I'm sure there are papers out there of groups doing similar things (predicting lncRNA), I would read through that literature as well to see what approaches and software people are typically using.

ADD REPLY
0
Entering edit mode

Thanks, I already read some literature and I found this is the usual approach. Although we may miss some because some of the lncRNA are anti-sense to coding sequence.

ADD REPLY
3
Entering edit mode
8.6 years ago

The annotation of lncRNA is a hard task nowadays, mainly because lncRNA don't share the same homology than protein coding genes, where orthologous genes can be found among different species. In addition, many of the lncRNA functions remain still unknown. I mean that lncRNA sharing the same function, does not necessarily share conserved sequences

I wouls say that most you can do is to try to identify putative lncRNA by using approaches already described here in biostars, like discarding coding genes and examinate the coding potential with WEB services

ADD COMMENT
0
Entering edit mode

Can you elaborate a bit more, how to discard the coding genes. Whether using 'mask' option in Cufflinks is okay to discard the coding genes.

What do you mean by "examinate the coding potential with WEB services"?

ADD REPLY
0
Entering edit mode

There is a published pipeline for discarding and finding putative lncRNA

1. You first look for coding genes using BlastX, and discard them

2. Discard any cds which is less than 200bp (by definition, a lnRNA is longer than 200bp)

3. For accesing to the coding portential calculator and further information LOOK HERE IN THIS WEB SERVICE

ADD REPLY
0
Entering edit mode

Dear Antonio R. Franco I want detect lncRNA from some human (control and treatment) RNA-seq data ( FASTA data format ), I detect genes and get Differential Expression by CLC genomics software, but I don’t know how I can detect lncRNA BY CLC genomics, in two articles use de novo assembly pathway and discovery detection pathway, I tried them separately but I do not know how to use the results of these pathways. I see, you suggested these pathways, I should use Blastx for genes that I can detect from RNAseq analysis? Makes it possible to explain more? Is it true finds lncRNA from same RNA-seq data that I use for detect genes or I should find only RNA - seq of non-coding RNA?

Your attention would be really appreciated.

ADD REPLY
1
Entering edit mode
8.6 years ago
vibes1002003 ▴ 30

Please look into this R Package: https://cran.r-project.org/web/packages/WGCNA/index.html

This package is useful for finding co-expressed genes in the form of modules or clusters. Here is some link of papers uses this package for similiar purpose: http://labs.genetics.ucla.edu/horvath/CoexpressionNetwork/

All the Best!

ADD COMMENT
0
Entering edit mode

For Co-expression analysis I have my own R-script, which is quite similar to WGCNA. The priority is to know the FPKM of coding and non-coding RNA's, which can be obtained only through mapping and assembly. So my query is suitable policy of mapping and assembly...

ADD REPLY
1
Entering edit mode
7.6 years ago
Calamy ▴ 10

Same as Antonio R. Franco said, if you want to annotate coding and long non-coding RNA together using transcriptome data,in my opinion, 1.Filtering the transcripts which overlapped with database annotation exon region by Cuffcompare software,and discard them; 2.Discard any cds which is less than 200bp (by definition, a lnRNA is longer than 200bp); 3.Filter transcriptome splicing results in a large number of low expression level, low confidence single exon transcript, select the number of exon> = 2 transcripts ,of course ,you can set approprite threshold based on your study; 4.Calculate the expression of each transcript by Cuffquant, select FPKM >=0.5(or others threshold) transcripts; 5.coding potential is critical condition to determine whether the transcript is lncRNA . Coding potential is essential to determine if a transcript is a lncRNA, so you can choose several popular software for coding potential analysis ,then use your results for coding potential filtering, including CPC software, CNCI, against Pfam database Analysis and PhyloCSF analysis even other coding-potential softwares, and the predicted lncRNAs come from the intersection of these methods.

ADD COMMENT
0
Entering edit mode

nihao! when i use CNCI,wo got a "CNCI.index" but there is only transcript_id without gene_name. so there is a question,how can i get the corresponding gene_name? Your attention would be really appreciated!

ADD REPLY

Login before adding your answer.

Traffic: 1946 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6