I have RNAseq data and I have to identify novel LncRNA and to perform DGE analysis. I have used following approach to find novel lncrnas:
Alignment through hisat2, assembly through stringtie, merging all assemblies through stringtie merge, classification of merged transcripts through gffcompare (assigning class codes), identifying novel transcripts (class code u).
But I don't understand how will I get LncRNA expression for each sample because I have identified novel transcripts from merged file. I have also performed DGE analysis for mRNA genes through deseq2. But I am confused which file do I need to prepare first to get lncRNA counts for each sample before moving to deseq2. please guide
Thanks in advance!
How is mRNA DE analysis any different from lncRNA DE analysis if you're comparison isoform level metrics?
My aim is to identify lncRNA differential expression from RNA-seq data. Till stringtie step, I have individual assembly files if I use these assembly files for deseq2 it is same like we are analysing expression of mRNA. After stringtie do I need to identify ncrna for each sample individualy and then should I move to deseq2? I am stuck at this step.
I'm not familiar with these tools and I don't understand the premise - are you not identifying features per sample already? How does it matter if a feature is mRNA or lncRNA if you can compare their level among samples? Or is the feature annotation process dependent on the merge process (which wouldn't make sense)
What organism is this for? If you are working with a model/well studied organism then you could simply use the locations of known lncRNA.
Kindly give your valuable suggestions on following strategy:
Is this appropriate approach to perform lncrna expression analysis?
It is not at all clear to me that filtering that early is advisable. Removing most of your RNA counts is going to alter normalization and dispersion estimates, and probably not in a good way. I'd keep all of the gene counts for all the genes, process data for all the genes, and then at the end, if you only care about lncRNA, filter away the results you don't care about.
Thanks swbarnes2, I will follow your valuable suggestion
Listen to swbarnes2 and rethink your entire approach.
Thanks GenoMax!
Its Arabidopsis thaliana
To locations do you mean chromosomal coordinates?
lncRNA for Arabidopsis are annotated : https://rnacentral.org/search?q=Arabidopsis%20thaliana%20AND%20so_rna_type_name:%22LncRNA%22
They should be included in the GTF file you probably have. They are in Ensembl GTF.
Yes GenoMax I used annotated GTF while constructing assembly through stringtie.
Arabidiosis is a thoroughly studied organism, I don't think you can make a better assembly with stringtie. Just use the genomic coords that are already documented in the gtf.
I have modified the approach as follows:
Now the question is how can I find the expression of lncrna from my RNAseq samples
What is your featureCounts command? You must be using a GTF file somewhere, dig into it.
Thanks Ram, I am using following command:
Look into AT10.58.gtf for references to lncRNA.
Do you mean that I should grep lncRNA from reference gtf and use that ?
There should gene_ID/ID counts that are labeled as
lncRNA
.An example from Ensembl's GTF (
lncRNA
is underbiotype
).Thanks GenoMax! Above command filters exactly what you presented.
Thanks a lot!