How do I update the gene names of TCGA data correctly?
1
0
Entering edit mode
12 weeks ago
Ivana • 0

Dear all,

I want to work with RNA-seq TCGA data and as I am working with a list with genes of interests that is annotated based on the latest update of HGNC (04.06.2024), I wanted to do the same with the TCGA gene names. However, when I do this (using the ensembl gene ID), there are roughly 3,800 genes that I cannot match. I also tried to match the names but there are even more genes that do not match.

I am still a beginner in bioinformatics and I would be greatful for any tips or suggestions on how to annotate/up-date the TCGA gene names!

Thank you!

Best,
Ivana

HGNC TCGA • 623 views
ADD COMMENT
0
Entering edit mode

What do you mean by "cannot match"? Can you give us an example?

ADD REPLY
0
Entering edit mode

This is a classic bioinformatics question, and there are no standard way to do so. You are balancing your mappings between FPs and FNs.

I normally ensemble all the following mappings

And then you can setup a rule. My rule is if a mapping is not unique, I will manually inspect it.

ADD REPLY
0
Entering edit mode
12 weeks ago
ATpoint 84k

Updates in gene annotations might result in deprecation of certain gene names, while others might be added. If you use existing TCGA quantifications (without starting from fastq files) then I would really just focus on gene IDs that currently have a match in the recent annotation you want to use, and give the others some dummy name, like missing_[0-9]+. There is no point trying to force-match them somehow. If they're gone in recent HGNC then they're gone. Only true "good" workaround would be to process TCGA from fastq files on, but that is access-restricted and tedious. You cannot expect old annotations to perfectly match recent ones. If that was the case then the new annotations would be pointless, no?

Or just use the existing annotations in the TCGA databases, without updating. Is it really critical to "update" here?

ADD COMMENT
0
Entering edit mode

Hi ATpoint, thank you for your suggestion and feedback on it! What I want to achieve is just basically check expression levels for my genes of interest and as they have been annotated with the latest update of HGNC, I thought I need to do the same with the TCGA gene names. My concern was that I might not be able to find my genes of interest just because TCGA gene names are still using some "previous" names. Going with dummy names is an interesting idea. I will give it a try! Thank you!

ADD REPLY
0
Entering edit mode

Can't you check by Ensembl ID? These should be constant.

ADD REPLY
0
Entering edit mode

Yes they might be constant but the problem is rather that a gene can be linked to multiple Ensembl IDs..

ADD REPLY
0
Entering edit mode

the problem is rather that a gene can be linked to multiple Ensembl IDs..

That's by design. Restrict yourself to canonical chromosomes and you should see 1<->1 mapping for the most part (except pseudogenes, miRNA, PAR genes etc.)

ADD REPLY

Login before adding your answer.

Traffic: 1196 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6