Question

gene_Id in ENCODE gene expression table

0

Entering edit mode

5.6 years ago

fusion.slope ▴ 250

Hello,

I would like to use the expression value of some genes in the ENCODE project.

I have a table with me but I can notice that the name of the genes is just a number. Does anyone know which format is it?

Here example: https://ibb.co/gcefj9

Does anyone know the name of this format so that I can convert to geneId?

Thanks in advance!

ENCODE Gene Conversion • 4.4k views

ADD COMMENT • link updated 4.0 years ago by haskankaya ▴ 80 • written 5.6 years ago by fusion.slope ▴ 250

1

Entering edit mode

HGNC ID, perhaps? Do you know what kinds of genes you are looking at? For example: https://www.genenames.org/cgi-bin/gene_symbol_report?hgnc_id=21175

ADD REPLY • link 5.6 years ago by Alex Reynolds 35k

0

Entering edit mode

I have a list of genes. I will take some genes that I know the name in the Gene Id and check in this website you suggested if they match. Then i will use http://biodb.jp/ to convert. Thanks for the info.

ADD REPLY • link 5.6 years ago by fusion.slope ▴ 250

0

Entering edit mode

Length and effective length numbers are small these to be full genes. It would be hard to say what those numeric gene ID's are. Where did you get the file from? Do you have a link?

ADD REPLY • link 5.6 years ago by GenoMax 142k

0

Entering edit mode

For example the tsv file here:

https://www.encodeproject.org/experiments/ENCSR000CPH/

click in file details

ADD REPLY • link 5.6 years ago by fusion.slope ▴ 250

1

Entering edit mode

This is what the explanation legend says:

Estimated expression levels from RSEM as a tsv file. The columns are as follows:

column 1: gene_id - gene name of the gene the transcript belongs to (parent gene). If no gene information is provided, gene_id and transcript_id is the same.
column 2: transcript_id(s) - transcript name of this transcript
column 3: length - the transcript's sequence length (poly(A) tail is not counted)
column 4: effective_length - the length containing only the positions that can generate a valid fragment
column 5: expected_count - the sum of the posterior probability of each read comes from this transcript over all reads

truncated for brevity.

ADD REPLY • link 5.6 years ago by GenoMax 142k

0

Entering edit mode

thanks a lot, i think alex already answered my question :) but to confirm i should check rsem output to see which gene_id reference they use..

ADD REPLY • link 5.6 years ago by fusion.slope ▴ 250

1

Entering edit mode

I don't think those are HGNC ID's. They are things which did not have a gene name.

If no gene information is provided, gene_id and transcript_id is the same.

Further down in the file you have normal gene identifiers.

ADD REPLY • link 5.6 years ago by GenoMax 142k

0

Entering edit mode

oh thanks a lot, I have scrolled a bit the file but did not go down enough to see the ENSEMBL gene annotation! much appreciated genomax!

ADD REPLY • link 5.6 years ago by fusion.slope ▴ 250

0

Entering edit mode

See: How to add images to a Biostars post

ADD REPLY • link 5.6 years ago by Ram 43k

score 6 · Accepted Answer · 2020-05-14

I have just spent quite a while trying to figure this out and finally solved the mystery: these odd lines refer to tRNAs and pseudo_tRNAs.

In the descriptions of analysis on the ENCODE website, there is no mention of any such features. I decided to look at the files that ENCODE's pipelines use as input to RSEM to figure out what they were. In the metadata table associated with the files, mine say they used annotation 'M4'. I went to ENCODE's 'Reference Sequences' page and took a look at this M4 annotation, but found that every feature in the file was of the format ENSMUSG....

It was only when I started digging through random annotation files on the ENCODE portal, such as this example, that I found the association between these values and the tRNAs.

For example, this is a snippet from the above-linked file:

10000   Pseudo_tRNA
10001   Pseudo_tRNA
10002   Pseudo_tRNA
10003   Pseudo_tRNA
10004   Pseudo_tRNA
10005   Pseudo_tRNA
10006   Ala_tRNA
10007   Pseudo_tRNA
10008   Lys_tRNA
10009   Pseudo_tRNA
10027   Ser_tRNA

I'm not entirely sure why these features are included in the output files, I suspect that it may be a mistake (if it's not, the analysis descriptions should be made clearer).

So for most analyses where you don't care about tRNAs, I reckon you can just delete the lines. Hope this answer saves some time for future explorers.