Question: gene_Id in ENCODE gene expression table
0
gravatar for fusion.slope
23 months ago by
fusion.slope220
fusion.slope220 wrote:

Hello,

I would like to use the expression value of some genes in the ENCODE project.

I have a table with me but I can notice that the name of the genes is just a number. Does anyone know which format is it?

Here example: https://ibb.co/gcefj9

Does anyone know the name of this format so that I can convert to geneId?

Thanks in advance!

conversion encode gene • 1.3k views
ADD COMMENTlink modified 3 months ago by haskankaya50 • written 23 months ago by fusion.slope220
1

HGNC ID, perhaps? Do you know what kinds of genes you are looking at? For example: https://www.genenames.org/cgi-bin/gene_symbol_report?hgnc_id=21175

ADD REPLYlink written 22 months ago by Alex Reynolds30k

I have a list of genes. I will take some genes that I know the name in the Gene Id and check in this website you suggested if they match. Then i will use http://biodb.jp/ to convert. Thanks for the info.

ADD REPLYlink written 22 months ago by fusion.slope220

Length and effective length numbers are small these to be full genes. It would be hard to say what those numeric gene ID's are. Where did you get the file from? Do you have a link?

ADD REPLYlink modified 22 months ago • written 22 months ago by genomax87k

For example the tsv file here:

https://www.encodeproject.org/experiments/ENCSR000CPH/

click in file details

ADD REPLYlink modified 22 months ago • written 22 months ago by fusion.slope220
1

This is what the explanation legend says:

Estimated expression levels from RSEM as a tsv file. The columns are as follows:

column 1: gene_id - gene name of the gene the transcript belongs to (parent gene). If no gene information is provided, gene_id and transcript_id is the same.
column 2: transcript_id(s) - transcript name of this transcript
column 3: length - the transcript's sequence length (poly(A) tail is not counted)
column 4: effective_length - the length containing only the positions that can generate a valid fragment
column 5: expected_count - the sum of the posterior probability of each read comes from this transcript over all reads

truncated for brevity.

ADD REPLYlink modified 22 months ago • written 22 months ago by genomax87k

thanks a lot, i think alex already answered my question :) but to confirm i should check rsem output to see which gene_id reference they use..

ADD REPLYlink written 22 months ago by fusion.slope220
1

I don't think those are HGNC ID's. They are things which did not have a gene name.

If no gene information is provided, gene_id and transcript_id is the same.

Further down in the file you have normal gene identifiers.

ADD REPLYlink modified 22 months ago • written 22 months ago by genomax87k

oh thanks a lot, I have scrolled a bit the file but did not go down enough to see the ENSEMBL gene annotation! much appreciated genomax!

ADD REPLYlink written 22 months ago by fusion.slope220

See: How to add images to a Biostars post

ADD REPLYlink written 22 months ago by RamRS28k
4
gravatar for haskankaya
3 months ago by
haskankaya50
London, UK
haskankaya50 wrote:

I have just spent quite a while trying to figure this out and finally solved the mystery: these odd lines refer to tRNAs and pseudo_tRNAs.

In the descriptions of analysis on the ENCODE website, there is no mention of any such features. I decided to look at the files that ENCODE's pipelines use as input to RSEM to figure out what they were. In the metadata table associated with the files, mine say they used annotation 'M4'. I went to ENCODE's 'Reference Sequences' page and took a look at this M4 annotation, but found that every feature in the file was of the format ENSMUSG....

It was only when I started digging through random annotation files on the ENCODE portal, such as this example, that I found the association between these values and the tRNAs.

For example, this is a snippet from the above-linked file:

10000   Pseudo_tRNA
10001   Pseudo_tRNA
10002   Pseudo_tRNA
10003   Pseudo_tRNA
10004   Pseudo_tRNA
10005   Pseudo_tRNA
10006   Ala_tRNA
10007   Pseudo_tRNA
10008   Lys_tRNA
10009   Pseudo_tRNA
10027   Ser_tRNA

I'm not entirely sure why these features are included in the output files, I suspect that it may be a mistake (if it's not, the analysis descriptions should be made clearer).

So for most analyses where you don't care about tRNAs, I reckon you can just delete the lines. Hope this answer saves some time for future explorers.

ADD COMMENTlink written 3 months ago by haskankaya50
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 836 users visited in the last hour