Question

There is no gene information in RSEM output

0

Entering edit mode

6.2 years ago

John ▴ 270

Hello scientists,

I ran RSEM to calculate gene and isoform expression level,

Code to Prepare reference:

rsem-prepare-reference --gtf mm9.gtf  --transcript-to-gene-map knownIsoforms.txt  --bowtie2 mm9.fa musmus

Downloaded the fasta file from: http://hgdownload.soe.ucsc.edu/goldenPath/mm9/chromosomes/
Known isoforms.txt from: http://hgdownload.soe.ucsc.edu/goldenPath/mm9/database/
gtf file from UCSC table browser.

code to calculate expression:

rsem-calculate-expression  --paired-end --bowtie2  forward_1.fastq reverse_2.fastq ref/musmus  cellnumber1

Results:

Following output I got from cellnumber1.genes.results file.

gene_id transcript_id(s)    length  effective_length    expected_count  TPM FPKM
1   uc007aet.1,uc007aeu.1   3621.00 3338.70 0.00    0.00    0.00
10  uc011whv.1  26.00   0.00    0.00    0.00    0.00
100 uc007amd.1,uc007ame.1   4355.00 4072.70 1.80    0.32    0.17
1000    uc007dac.1  1403.00 1120.70 0.00    0.00    0.00
10000   uc008ajp.1,uc012ajs.1   1415.50 1133.20 0.00    0.00    0.00
10001   uc008ajq.1  2046.00 1763.70 0.00    0.00    0.00
10002   uc008ajr.1,uc008ajs.1,uc008ajt.1,uc008aju.1,uc012ajt.1  6290.60 6008.64 0.00    0.00    0.00

And I don't see any gene name in the gene_id column, rather it shows only numbers! I don't know why!, Is this a correct output? how do I get gene information! (In some tutorials the output looks different from this)

thanks in advance! please help!

rsem RNA-Seq rna-seq • 2.8k views

ADD COMMENT • link updated 6.2 years ago by Sean Davis 26k • written 6.2 years ago by John ▴ 270

1

Entering edit mode

But you have transcript Ids right, e.g. uc007aet.1 and uc008ajq.1. Those are Knowngene identifiers, corresponding to the knowngene transcriptome you downloaded.

ADD REPLY • link 6.2 years ago by WouterDeCoster 47k

0

Entering edit mode

Yeah WouterDeCoster, But I want to do differential expression analysis, so I want them as gene name, I may map them to gene name (using some tools/ucsc table browser) but a single line contains multiple transcript ID which is separated by comma. I don't know how to do !

p.s I have 70 sequences, If it is not working, I should redo with ensembl reference! please help me

thanks for your response

ADD REPLY • link 6.2 years ago by John ▴ 270

score 0 · Answer 1 · 2018-02-20

I'm not sure what you are aiming for, here, but you have a file with gene_id in the first column. That gene_id looks like it is an Entrez Gene ID. You can perform your differential expression analysis and at whatever point is convenient, map that Entrez ID to the HGNC symbol. There are many resources for doing so.

Your column with comma-separated transcripts comes about because genes often have multiple transcripts. For the purposes of differential gene expression analysis, you can probably just ignore that detail and focus on the gene_id.