Question: How to separate lncRNA and mRNA from RNA-Seq data for differential expression?
0
gravatar for anilbioinfo1995
12 weeks ago by
anilbioinfo19950 wrote:

Hi

I am having RNA-Seq data from TCGA. I want to separate lncRNA and mRNA from RNA-Seq data for differential expression. Please, guide me how to separate lncRNA and mRNA from RNA-Seq data?

Thank you

rna-seq • 243 views
ADD COMMENTlink modified 12 weeks ago • written 12 weeks ago by anilbioinfo19950
2

Please explain how you performed the analysis? Alignment -> quantification (which annotation) -> Differential expression (which tool) ? And what kind of file you have now (paste sample output)?

When you provide more information, you get more specific response.

ADD REPLYlink written 12 weeks ago by EagleEye6.2k

I downloaded raw RNA-Seq data of cancer in fastq format. Then I performed alignment using Tophat. Now, I am planning to do quantification and differential expression using Salmon and Deseq2. But I have to identify lncRNA and mRNA before differential expression.

ADD REPLYlink written 12 weeks ago by anilbioinfo19950
2

few remarks:

TopHat is no longer advised to do read alignment (using Salmon you even don't need to do actual alignments anymore).

I'm puzzled why you are eager to filter gene types before doing DEG analysis? why not use all of them and filter after doing DEG?

ADD REPLYlink written 12 weeks ago by lieven.sterck4.5k
2

Can you add a bit more details on this? For example, what kind of data do you have? which organism are you working with?

From 'raw' RNAseq data it will be impossible to filter out reads that are derived from either one of them. If you align them to a genome or transcriptome you might be able to distinguish them.

ADD REPLYlink written 12 weeks ago by lieven.sterck4.5k
2
gravatar for ATpoint
12 weeks ago by
ATpoint15k
Germany
ATpoint15k wrote:

Check the GTF (genome annotation file) that matches your analysis and filter out those genes that have a CDS (coding sequence) and/or are annotated as protein-coding. Having the gene names, you can then scan your count matrix before or after differential expression analysis for the genes of interest.

Edit: See Kevin's comment below for details.

ADD COMMENTlink modified 12 weeks ago • written 12 weeks ago by ATpoint15k

Is there any tool to filter out lncRNA and mRNA from GTF? How can I scan count matrix using the gene names?

ADD REPLYlink written 12 weeks ago by anilbioinfo19950
2

You can grep lines annotated as lncRNA/mRNA from your GTF file.

ADD REPLYlink written 12 weeks ago by genomax65k
2

Indeed, was just doing it:

Obtain the relevant GTF from GENCODE (you will want the Comprehensive gene annotation) and then identify lincRNA and protein_coding mRNA via the gene_type / transcript_type tag in the GTF.

grep -e "lincRNA" /Kev/CollegeWork/ReferenceMaterial/GENCODE/GRCh38.p12/gencode.v28.annotation.gtf | head -5
chr1    HAVANA  gene    29554   31109   .   +   .   gene_id "ENSG00000243485.5"; gene_type "lincRNA"; gene_name "RP11-34P13.3"; level 2; tag "ncRNA_host"; havana_gene "OTTHUMG00000000959.2";
chr1    HAVANA  transcript  29554   31097   .   +   .   gene_id "ENSG00000243485.5"; transcript_id "ENST00000473358.1"; gene_type "lincRNA"; gene_name "RP11-34P13.3"; transcript_type "lincRNA"; transcript_name "RP11-34P13.3-001"; level 2; transcript_support_level "5"; tag "not_best_in_genome_evidence"; tag "dotter_confirmed"; tag "basic"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002840.1";
chr1    HAVANA  exon    29554   30039   .   +   .   gene_id "ENSG00000243485.5"; transcript_id "ENST00000473358.1"; gene_type "lincRNA"; gene_name "RP11-34P13.3"; transcript_type "lincRNA"; transcript_name "RP11-34P13.3-001"; exon_number 1; exon_id "ENSE00001947070.1"; level 2; transcript_support_level "5"; tag "not_best_in_genome_evidence"; tag "dotter_confirmed"; tag "basic"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002840.1";
chr1    HAVANA  exon    30564   30667   .   +   .   gene_id "ENSG00000243485.5"; transcript_id "ENST00000473358.1"; gene_type "lincRNA"; gene_name "RP11-34P13.3"; transcript_type "lincRNA"; transcript_name "RP11-34P13.3-001"; exon_number 2; exon_id "ENSE00001922571.1"; level 2; transcript_support_level "5"; tag "not_best_in_genome_evidence"; tag "dotter_confirmed"; tag "basic"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002840.1";
chr1    HAVANA  exon    30976   31097   .   +   .   gene_id "ENSG00000243485.5"; transcript_id "ENST00000473358.1"; gene_type "lincRNA"; gene_name "RP11-34P13.3"; transcript_type "lincRNA"; transcript_name "RP11-34P13.3-001"; exon_number 3; exon_id "ENSE00001827679.1"; level 2; transcript_support_level "5"; tag "not_best_in_genome_evidence"; tag "dotter_confirmed"; tag "basic"; havana_gene "OTTHUMG00000000959.2"; havana_transcript "OTTHUMT00000002840.1";

Take a look at all biotypes, here: https://www.gencodegenes.org/pages/biotypes.html

ADD REPLYlink modified 12 weeks ago • written 12 weeks ago by Kevin Blighe41k
2

I recommend investing time learning Unix tool usage, they're powerful and versatile, you'll need them everyday in your bioinformatics career.

ADD REPLYlink written 12 weeks ago by ATpoint15k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1339 users visited in the last hour