Question

BiomaRt gene symbol to gene ID: incomplete conversion

1

Entering edit mode

2.3 years ago

Filipe ▴ 10

I am in the process of doing enrichment analysis for DEG in Atlantic salmon. To do so, I need to convert the gene names that came out of STAR/HTseq to the same format, so they are uniform.

Results sample:

# A tibble: 1,053 × 1
   id          
   <chr>       
 1 LOC106612275
 2 LOC106611534
 3 LOC100380739
 4 LOC106610718
 5 LOC106573648
 6 btk         
 7 LOC106613810
 8 LOC106571198
 9 LOC106582786
10 gria4       
# … with 1,043 more rows

I used biomaRt,

ensembl <- useEnsembl(biomart = 'genes', dataset = 'ssalar_gene_ensembl')
gene_id <- getBM(attributes = c('entrezgene_accession', 'entrezgene_id'),
                            filters = 'entrezgene_accession',
                            values = sig_res,
                            uniqueRows = T,
                            mart = ensembl)

And got the conversions,

> head(gene_id)
  entrezgene_accession entrezgene_id
1                abce1     100194739
2                 abi1     100195035
3               acadvl     106602820
4               afg3l2     106597060
5                aggf1     106568194
6                  agt     100195417

However, when I joined the converted IDs with the initial gene list supplied to biomaRt, I noticed that there was a mismatch of 191 genes that did not convert.

> nrow(sig_res) - nrow(gene_id)
[1] 191

By looking at the concatenated list of initial + converted, there are indeed a few genes that did not get their gene IDs assigned.

> sig_res %>% left_join(gene_id, by = c('id' = 'entrezgene_accession')) %>% 
+   mutate(complete_id = if_else(is.na(entrezgene_id), id, entrezgene_id)) %>%
+   mutate(complete_id = gsub('LOC', '', complete_id)) -> full_sig_res

> arrange(full_sig_res, desc(complete_id))
# A tibble: 1,060 × 3
   id      entrezgene_id complete_id
   <chr>   <chr>         <chr>      
 1 wdfy4   NA            wdfy4      
 2 vcan    NA            vcan       
 3 timm9   NA            timm9      
 4 rps6kc1 NA            rps6kc1    
 5 ptpn14  NA            ptpn14     
 6 ppp1r17 NA            ppp1r17    
 7 nckp1   NA            nckp1      
 8 mk67i   NA            mk67i      
 9 klf7    NA            klf7       
10 ki67    NA            ki67       
# … with 1,050 more rows

For example, by looking at one of the genes that did not get converted, wdfy4, it shows up on NCBI in Atlantic salmon, with a gene ID, but it does not show up on Ensembl. On Ensembl, it only returns results for human.

Now I reckon this is related with the fact that I used RefSeq's Atlantic salmon annotation for the initial alignment and gene count, while Ensembl's version is used by biomaRt for the ID conversion. Does anyone have an idea how I can solve this issue? i.e. convert all the unconverted gene names to gene IDs.

biomart R entrez ensembl gene ID • 1.7k views

ADD COMMENT • link updated 2.3 years ago by Ben_Ensembl ★ 2.4k • written 2.3 years ago by Filipe ▴ 10

score 3 · Answer 1 · 2022-01-13

3

Entering edit mode

2.3 years ago

Ben_Ensembl ★ 2.4k

Hi Filipe,

I'm not sure if this will solve your query, but it's also worth noting that the Atlantic Salmon assembly and annotation available in Ensembl (and therefore BioMart) is the ICSASG_v2 assembly: https://www.ensembl.org/Salmo_salar/Info/Index

However, the SSAL_V3 assembly and annotation is available in the new Ensembl Rapid Release genome browser: https://rapid.ensembl.org/Salmo_salar_GCA_905237065.2/Info/Index

There is a gene annotated in the genomic region corresponding to the NCBI wdfy4 annotation: https://rapid.ensembl.org/Salmo_salar_GCA_905237065.2/Location/View?db=core;g=ENSSSAG00000000139;r=1:49354636-49565797;t=ENSSSAT00000189114

However, Ensembl Rapid Release is a lightweight genome browser that does not contain gene symbol mapping and data mining tools such as BioMart or REST API endpoints. There are genome-wide files available through the FTP site, which may be of use for your ID conversion: ftp://ftp.ensembl.org/pub/rapid-release

ADD COMMENT • link 2.3 years ago by Ben_Ensembl ★ 2.4k

0

Entering edit mode

Thank you for your reply, Ben.

So, until the Rapid Release annotation is 'added' to biomaRt, it will not be possible to convert these IDs (through biomaRt at least).

Is there any other viable alternative to convert gene symbols to IDs using the latest annotation? Manual curation is not an option for me, since we'd be talking about thousands of genes in total.

ADD REPLY • link 2.3 years ago by Filipe ▴ 10

0

Entering edit mode

No problem, Filipe. I'm not sure about alternative methods for the gene symbol conversion you need to perform. Maybe someone else here as a good idea?

ADD REPLY • link 2.3 years ago by Ben_Ensembl ★ 2.4k