I am a Biologist and a novice in analysis of NGS data. I have a set of six transcriptomes. I want to find the expression of the coding genes as well as lnRNA in each set and than compare them to find the co-expressed clusters. For that I need the FPKM of coding genes as well lncRNA. I have the experience of using Tophat + Cufflinks based De-Novo and RABT assembly and find FPKM of coding genes. But how to annotate the lncRNA. In Tophat + Cuuflinks mapping and assembly, the genes are assembled based on the supplied GTF file while novel cases like novel genes or isoform of existing genes are found based on novel junctions. Whether the lncRNA co-ordinates will also be present in the GTF file..??
As long as you provide a GTF file that contains both coding and lncRNA you should be fine. If you use the GENCODE annotations, it contains both. The latest release for GRCh37 is GENCODE 19: http://www.gencodegenes.org/releases/19.html
There are newer versions for GRCh38 if you are using that reference in your analysis.
The annotation of lncRNA is a hard task nowadays, mainly because lncRNA don't share the same homology than protein coding genes, where orthologous genes can be found among different species. In addition, many of the lncRNA functions remain still unknown. I mean that lncRNA sharing the same function, does not necessarily share conserved sequences
I wouls say that most you can do is to try to identify putative lncRNA by using approaches already described here in biostars, like discarding coding genes and examinate the coding potential with WEB services
Please look into this R Package:
this package is useful for finding co-expressed genes in the form of modules or clusters.
Here is some link of papers uses this package for similiar purpose.
All the Best!!
Same as Antonio R. Franco said, if you want to annotate coding and long non-coding RNA together using transcriptome data,in my opinion, 1.Filtering the transcripts which overlapped with database annotation exon region by Cuffcompare software,and discard them; 2.Discard any cds which is less than 200bp (by definition, a lnRNA is longer than 200bp); 3.Filter transcriptome splicing results in a large number of low expression level, low confidence single exon transcript, select the number of exon> = 2 transcripts ,of course ,you can set approprite threshold based on your study; 4.Calculate the expression of each transcript by Cuffquant, select FPKM >=0.5(or others threshold) transcripts; 5.coding potential is critical condition to determine whether the transcript is lncRNA . Coding potential is essential to determine if a transcript is a lncRNA, so you can choose several popular software for coding potential analysis ,then use your results for coding potential filtering, including CPC software, CNCI, against Pfam database Analysis and PhyloCSF analysis even other coding-potential softwares, and the predicted lncRNAs come from the intersection of these methods.