I am in the process of doing enrichment analysis for DEG in Atlantic salmon. To do so, I need to convert the gene names that came out of STAR/HTseq to the same format, so they are uniform.
Results sample:
# A tibble: 1,053 × 1
id
<chr>
1 LOC106612275
2 LOC106611534
3 LOC100380739
4 LOC106610718
5 LOC106573648
6 btk
7 LOC106613810
8 LOC106571198
9 LOC106582786
10 gria4
# … with 1,043 more rows
I used biomaRt,
ensembl <- useEnsembl(biomart = 'genes', dataset = 'ssalar_gene_ensembl')
gene_id <- getBM(attributes = c('entrezgene_accession', 'entrezgene_id'),
filters = 'entrezgene_accession',
values = sig_res,
uniqueRows = T,
mart = ensembl)
And got the conversions,
> head(gene_id)
entrezgene_accession entrezgene_id
1 abce1 100194739
2 abi1 100195035
3 acadvl 106602820
4 afg3l2 106597060
5 aggf1 106568194
6 agt 100195417
However, when I joined the converted IDs with the initial gene list supplied to biomaRt, I noticed that there was a mismatch of 191 genes that did not convert.
> nrow(sig_res) - nrow(gene_id)
[1] 191
By looking at the concatenated list of initial + converted, there are indeed a few genes that did not get their gene IDs assigned.
> sig_res %>% left_join(gene_id, by = c('id' = 'entrezgene_accession')) %>%
+ mutate(complete_id = if_else(is.na(entrezgene_id), id, entrezgene_id)) %>%
+ mutate(complete_id = gsub('LOC', '', complete_id)) -> full_sig_res
> arrange(full_sig_res, desc(complete_id))
# A tibble: 1,060 × 3
id entrezgene_id complete_id
<chr> <chr> <chr>
1 wdfy4 NA wdfy4
2 vcan NA vcan
3 timm9 NA timm9
4 rps6kc1 NA rps6kc1
5 ptpn14 NA ptpn14
6 ppp1r17 NA ppp1r17
7 nckp1 NA nckp1
8 mk67i NA mk67i
9 klf7 NA klf7
10 ki67 NA ki67
# … with 1,050 more rows
For example, by looking at one of the genes that did not get converted, wdfy4, it shows up on NCBI in Atlantic salmon, with a gene ID, but it does not show up on Ensembl. On Ensembl, it only returns results for human.
Now I reckon this is related with the fact that I used RefSeq's Atlantic salmon annotation for the initial alignment and gene count, while Ensembl's version is used by biomaRt for the ID conversion. Does anyone have an idea how I can solve this issue? i.e. convert all the unconverted gene names to gene IDs.
Thank you for your reply, Ben.
So, until the Rapid Release annotation is 'added' to biomaRt, it will not be possible to convert these IDs (through biomaRt at least).
Is there any other viable alternative to convert gene symbols to IDs using the latest annotation? Manual curation is not an option for me, since we'd be talking about thousands of genes in total.
No problem, Filipe. I'm not sure about alternative methods for the gene symbol conversion you need to perform. Maybe someone else here as a good idea?