Easiest way to homogenize gene aliases across datasets?
3
0
Entering edit mode
3.4 years ago
dlmatera ▴ 10

I've got 3 rna seq datasets Im trying to integrate and there are ~1000genes with different aliases across sets (when I integrate them these genes do not merge appropriately as the Seurat package assumes these different aliases are in fact different genes)

Is there any easy way to go about homogenizing all of these? I've tried the online converters but that has not worked very well? Logically it seems easy to convert all of these to ensembl or entrez IDs and then convert back to gene IDs, but since these are sc-rna-seq datasets there is a variety of "gene names" which are for alignment/normalization that get lost during the conversion process since they aren't actually genes (these are of course needed for my downstream analysis). If I have a list of the ~1000 aliases, is there an easy way to search for these aliases in my datasets and replace with a single gene name?

RNA-Seq R gene alignment • 4.0k views
ADD COMMENT
0
Entering edit mode

Can you provide a few examples of these aliases? Are they all human genes or some other organism?

ADD REPLY
0
Entering edit mode

all human gene symbols, for example one is SELENOW (dataset#1), SELW (dataset#2), and SEPW1 (dataset #3)

ADD REPLY
0
Entering edit mode

HGNC:10752 is the unification you need here: https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/10752

Symbol: SELENOW
Previous Symbol(s): SEPW1
Alias Symbol(s): SELW

The best resource that I use for this is GeneCards. While online searching (manually) is free, batch/command line access is not. I use it to narrow down on genes if ambiguity remains after filtering through all known sources (NCBI, biomart, HGNC, etc)

ADD REPLY
1
Entering edit mode

I will caution against relying too heavily on GeneCards, as it's essentially a commercial entity and it's not clear how much of their data is pulled. It's super convenient for a spot check, but as Ram mentioned, there are more open sources that are much easier to automate (several of which have been mentioned here) and are either updated in real-time or are the actual data providers.

ADD REPLY
0
Entering edit mode

You can download current HGNC names from their BioMart. Select previous symbol and alias symbol for output in addition. That should cover situation Illustrated above.

ADD REPLY
0
Entering edit mode

im curious if there's an easy way to automate this?

ADD REPLY
1
Entering edit mode

You can download the entire list by using "Download" button once you get the initial screen of results. There are ~42K symbols at the moment.

ADD REPLY
0
Entering edit mode

Just to clarify: By "gene ID", are you referring to HGNC symbols? What are "gene names"?

ADD REPLY
0
Entering edit mode

yep gene symbols, not sure if they are all the official HGNC one

ADD REPLY
0
Entering edit mode

You're going to need to be more specific and treat each symbol differently then - those that are official HGNC symbols, those that are synonyms or previous symbols of official HGNC symbols, those not found in HGNC but found in, say, the NCBI Gene database, etc. It's not an exact process, but it gets you close to addressing all entries. It will involve manual work too.

ADD REPLY
0
Entering edit mode

I would convert everything to Ensembl Gene ID which is the only reliable way to ensure that you have consistency. Is there anything wrong with this approach?

ADD REPLY
0
Entering edit mode

ENSG IDs don't have a 1:1 mapping to HGNC gene symbols. One way to narrow it down is to eliminate those ENSGs that map to HGNC symbols for genes that fall outside the 1-22XY chromosomes (i.e. in the patch/alt contigs).

ADD REPLY
0
Entering edit mode

Ok so I downloaded the HGNC dataset (thanks to all who helped me find this) and reorganized my data, now I have

List #1- All features (gene names..aliases and official) from my dataset in one column. List #2 -official gene names from HGNC (column 1) an aliase (column 2)

Im currently trying to find functions in R, but it seems some type of loop could automate the scheme below? (Im terrible at coding as im an experimentalist so if anyone has ideas im happy to receive)

-for each row in list #1, search the value (which should be a string?) for a match in list #2 column #2 (the aliases), if a match is found, replace the original string in List#1 with the value in list #2 column#1

This would work right?

ADD REPLY
1
Entering edit mode

Try some things out and if you run into problems ask a new focused question providing any code you tried to use.

ADD REPLY
0
Entering edit mode

Don't replace - you may end up with duplicates. This problem needs a much more nuanced and save-state-at-each-step approach than a simple Find-and-Replace. Tread carefully while mapping - replacing without a record of what you replaced and what you replaced it with, along with where you got these from as well as when the replacement was done will get you lost where you won't be able to map your final RNAseq dataset to your initial RNAseq dataset.

ADD REPLY
0
Entering edit mode

Hi all! Thanks for the help I think I have got this working. However, some have recommended I find a way to check if the results or accurate (or do something more advanced than a find& replace because this could introduce errors?) Does anyone have some recommendations? I manually checked the results & it seemed accurate (although I obviously cant check 26,000 genes manually). I also started out with a trial dataset of ~100 genes and it was 100% accurate for that. Do people still think something more complicated/additional validation is necessary?

See code below: (dataset is a dataframe of 26k gene names & database is a dataframe w/ 2 columns one of which contains gene aliases and the other which contains the approved symbol

indx<-match(dataset$gene, databse$Alias, nomatch = 0)  #create and index of all locations in the dataset where the gene name matches that of a known gene alias 

dataset$gene[indx !=0] <- database$approvedsymbol[indx]
 #for locations where indx=/0 (when a gene alias is found), replace the gene name in the dataset with its approved symbol from the databse
ADD REPLY
1
Entering edit mode

I know I'm being a pain in the behind with this - any good method you use will have 90-95% accuracy on the entire 26,000 gene dataset. You'll only run into problems with pseudogenes, AS/DT entries etc. There will come a point when you will have to ignore a few entries, so you may as well go with what works best for you and fix errors as you encounter them later. Gene identifiers are not a perfect thing and you'll never have a solution that works 100%.

ADD REPLY
3
Entering edit mode
3.4 years ago

The best tool I've found for dealing with gene symbol aliases is the MyGene API. It has associated R and python packages as well.

It's relatively straightforward to use, but the general process for this would go something like:

1.) Query via gene symbol and grab the top hit:

http://mygene.info/v3/query?q=symbol:cdk2&species=human

{"took": 4, 
"total": 1, 
"max_score": 88.06128, 
"hits": [{"_id": "1017", "_score": 88.06128, "entrezgene": "1017", "name": "cyclin dependent kinase 2", "symbol": "CDK2", "taxid": 9606}]}

2.) Pull the aliases/official symbol for the Entrez ID of the top hit:

https://mygene.info/v3/gene/1017?fields=alias,symbol

{"_id": "1017", "_version": 9, "alias": ["CDKN2", "p33(CDK2)"], "symbol": "CDK2"}

3.) Look for alias symbols in your other samples and change them to the official symbol. If necessary, change your query symbol to the official one too (as it could be an alias). I leave the implementation of this to you. I highly recommend you retain the original symbols somewhere rather than replacing them outright. In reality, I'd use stable gene identifiers as much as possible (ENSG preferably). As others have mentioned, this isn't a simple problem and has many edge cases.

I expect this will deal with most genes relatively well. The lncRNAs/pseudogenes/etc are a crap shoot.

ADD COMMENT
0
Entering edit mode

lncRNAs, pseudogenes and AS/DT entries are where most measures fail - this is when NCBI Gene (Eiither through GenoMax's command line method or the org.Hs.eg.db method I mention in a comment on his answer) is really useful. I use a custom download from HGNC for the rest of the sanely mappable genes.

The problem is when 2 aliases are both present in an RNAseq dataset. The best way to do this is to go back to the GTF files for each dataset and map from there, given that they have more information on each gene. The RNAseq dataset itself only has entries from the gene_id field of the GTF file.

ADD REPLY
0
Entering edit mode

The problem is when 2 aliases are both present in an RNAseq dataset. The best way to do this is to go back to the GTF files for each dataset and map from there, given that they have more information on each gene. The RNAseq dataset itself only has entries from the gene_id field of the GTF file.

This is a great point and really highlights the benefits of using stable gene IDs over (or at least with) gene symbols whenever possible.

ADD REPLY
2
Entering edit mode
3.4 years ago
vkkodali_ncbi ★ 3.7k

You can use NCBI Datasets for this. Specifically, you can use the command line tool as follows:

for s in SELENOW SELW SEPW1 ; do 
echo -ne "$s\t" ; 
datasets summary gene symbol $s | jq -r '.genes[] | [.gene.gene_id,.gene.symbol] | @tsv ' ; 
done 

SELENOW 6415    SELENOW
SELW    6415    SELENOW
SEPW1   6415    SELENOW

The output of datasets command here is in json format that I am parsing using the tool jq. The final tab-delimited output has the input symbol, NCBI GeneID and the HGNC symbol.

ADD COMMENT
2
Entering edit mode
3.4 years ago
GenoMax 141k

With EntrezDirect (additional parsing needed) :

$ esearch -db gene -query "human [orgn]" | efetch | grep -A 1 "Official" | head -11

Official Symbol: TP53 and Name: tumor protein p53 [Homo sapiens (human)]
Other Aliases: BCC7, BMFS5, LFS1, P53, TRP53
--
Official Symbol: EGFR and Name: epidermal growth factor receptor [Homo sapiens (human)]
Other Aliases: ERBB, ERBB1, HER1, NISBD2, PIG61, mENA
--
Official Symbol: TNF and Name: tumor necrosis factor [Homo sapiens (human)]
Other Aliases: DIF-alpha, TNFA, TNFSF2, TNLG1F, TNF
--
Official Symbol: APOE and Name: apolipoprotein E [Homo sapiens (human)]
Other Aliases: AD2, APO-E, ApoE4, LDLCQ5, LPG
ADD COMMENT
0
Entering edit mode

As an add-on to this: I use the R package org.Hs.eg.db to get one of SYMBOL, GENENAME and ALIAS from another of the same set. This way, I can stick to R and use biomaRt/reutils/org.Hs.eg.db.

ADD REPLY

Login before adding your answer.

Traffic: 2976 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6