Question

Tool to search for missing genes among genome annotation versions?

0

Entering edit mode

8 months ago

txema.heredia ▴ 110

Hi,

When exploring datasets of published papers, more often that what I'd like I realize that the genes present in the dataset, and the annotation file version mentioned in the materials & methods do not match.

Usually I end up with a list of genes present in the data but that do not exist in the annotation version that was supposedly used. Then it is my turn to enter into a spiral of testing different ensembl/gencode versions until I find the version that misses the least genes (spoiler: there is never a perfect match).

Is there an online tool that lets you enter a list of gene names or ids, and returns a list of annotation versions where they are present?

Edit: I've found this tool to search for synonymous gene names https://www.genenames.org/tools/multi-symbol-checker/ which is another common problem in this. However, it doesn't tell you when did the name change happen, so you cannot start digging for annotation versions older than that.

annotation • 1.2k views

ADD COMMENT • link updated 8 months ago by LauferVA 4.2k • written 8 months ago by txema.heredia ▴ 110

0

Entering edit mode

being able to effectively map between annotations of different kinds is among the most necessary skill sets in bioinformatics. id start with a comprehensive resource, e.g. https://biostar.myshopify.com/

ADD REPLY • link 8 months ago by LauferVA 4.2k

0

Entering edit mode

Thanks for your comment.

However, I fail to see which one of these courses can help me identify which version of the annotation files contains a deprecated gene symbol or alias. Or which source was really used and then misreported in the material and methods of the article. Could you guide me through the linked courses which one can help me with this kind of issue?

What I am facing right now is a dataset with ~3800 genes supposedly annotated against Gencode v33. The dataset contains 20 genes using a deprecated alias or symbol not present in the annotation file (I could match them to the symbols in annotation file using https://www.genenames.org/tools/multi-symbol-checker/ , ensembl, and google), and 10 more genes that cannot be resolved.

For example:

the dataset contains an entry for the DUXAP10 gene. Such gene is not present in gencode v33.
The current version of Ensembl lists LNMAT1 as an alias for DUXAP10.
LNMAT1 is also not present in gencode v33.
https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:32188 lists LNMAT1 as an alias of DUXAP9. Ensembl lists LNMAT1 as an alias of both DUXAP9 and DUXAP10.
DUXAP9 is indeed present in Gencode v33.
I downloaded newer versions of Gencode, and DUXAP10 only appears from v35 on.
However, using gencode v35, now there are 46 genes form the dataset not matching the annotation. Some of them are those with old aliases. But now there are 15 genes that have been removed from gencode since version v33 (their gene_id has been removed from ensembl its last appearance was v100)

How can I "discover" the actual source (or combination of sources) of the annotation file used to produce this dataset?

ADD REPLY • link 8 months ago by txema.heredia ▴ 110

0

Entering edit mode

hi again, you are very right that my answer doesn't get you very far in solving what is admittedly a thorny problem. not so much thorny i guess, as just, "rote" and to some degree time intensive.

you are on the right track. what you need to do is curate a superset of the possible annotations and then map what you can to what you can, converging ulimately on a single up to date annotation set.

based on your answer, it seems like you are doing this yourself. this is how most people start, but generally speaking it is a better use of time to draw on other resources that have already done this.

consider, for instance, packages like AnnotationHub, GO.db, etc. that already have pre-populated tables with these gene annotations by version for Gencode, GO, official gene symbol, HUGO, on and on.

another good resource is UCSC table browser, if you spend enough time doing various things on there, eventually you'll see it is a very powerful resource for problems like this.

anyway, once you have what you consider to be a plausible superset of possible annotation Dbs, simply run all against all. the smoking gun is if one annotation Db actually gets all of them; but if this doesnt happen you have recourse...

that help?

ADD REPLY • link 8 months ago by LauferVA 4.2k

score 2 · Answer 1 · 2023-08-11

2

Entering edit mode

8 months ago

barslmn ★ 2.1k

Hi, I had the exact same problem and made a tool. It doesn't supply when the change happened like you specify but does show which gene symbol (approved, alias or previos) in which reference annotation. Output is text and important part is the warnings parts at the header. I have been the only user so I am curious if it is any use to anyone else. :)

You can find the tool here: https://github.com/barslmn/cross-symbol-checker/

There is an online version up on my website: https://omics.sbs/bioscripts/crosssymbolchecker/

Description:

This tool maps aliases, and previous or withdrawn symbols to current approved HGNC symbols as well as check for any entries that are not gene symbols and fix capitalizations. Ensembl and RefSeq annotation files sometimes use previous or alias symbols. This tool also cross checks against Ensembl and NCBI annotation files for different genome versions and shows which gene symbols used. This way an accurate gene set can be utilized to avoid false negatives in the variant discovery process. The source code is available on GitHub.

ADD COMMENT • link 8 months ago by barslmn ★ 2.1k

0

Entering edit mode

cool! i have a sort of ad hoc version, but not so well organized and certainly not in a docker container.

i will try the tool out - one thought (pre testing) is that this would possibly be better pitched as a bioconductor package or even an adjunct to an existing effort (rather than in the shell). thoughts?

ADD REPLY • link 8 months ago by LauferVA 4.2k

0

Entering edit mode

I had the idea of having it rewritten in a different language. I was thinking about C but having it in R would be (and in bioconductor) would be much more practical.

ADD REPLY • link 8 months ago by barslmn ★ 2.1k

0

Entering edit mode

entirely your decision. just that doing it that way enables you to take advantage of the large scale efforts that have already been undertaken in this area (e.g., AnnotationHub, GO.db, many others).

ADD REPLY • link 8 months ago by LauferVA 4.2k