Question: Biomart query returns NA when searching for entrez_id, while manual search works
0
gravatar for eggrandio
5 weeks ago by
eggrandio40
eggrandio40 wrote:

Hi,

I am using biomart to convert gene IDs into entrezid accessions. I am working with arabidopsis genes, and sometimes, the query will not return an entrezid. However, when I look up that gene in the ensembl webpage, I am able to find the correspondence. Maybe I am using the wrong Mart/Dataset?

Here is the code I am using for reference and the results I obtain:

library(biomaRt)
ensembl = useMart("plants_mart",host="plants.ensembl.org")
ensembl = useDataset("athaliana_eg_gene",mart=ensembl)
genes = c("AT2G14610","AT4G23700","AT3G26830","AT3G15950","AT3G54830","AT5G24105")
query = getBM(attributes=c("ensembl_gene_id",
                       "entrezgene_id",
                       "refseq_dna",
                       "entrezgene_accession"),
          filters=("ensembl_gene_id"),
          values=genes,mart=ensembl)

> query
   ensembl_gene_id entrezgene_id     refseq_dna entrezgene_accession
1        AT2G14610        815949    NM_127025.3               815949
2        AT3G15950        820839    NM_112465.4               820839
3        AT3G15950        820839 NM_001035631.2               820839
4        AT3G15950        820839 NM_001338192.1               820839
5        AT3G15950        820839 NM_001338191.1               820839
6        AT3G15950        820839 NM_001338193.1               820839
7        AT3G26830        822298    NM_113595.4               822298
8        AT3G54830            NA                                  NA
9        AT4G23700        828470 NM_001341626.1               828470
10       AT4G23700        828470    NM_118501.5               828470
11       AT5G24105       2745995    NM_203099.2              2745995

In this case, AT3G54830 does not show any entrezgene_id or refseq_dna. However, when I manually search for it at the plant.ensembl.org or NCBI webpages, I can find it:

https://plants.ensembl.org/Arabidopsis_thaliana/Transcript/Summary?db=core;g=AT3G54830;r=3:20311901-20315887;t=AT3G54830.1

enter image description here

Or when I search for it in the NCBI webpage:

https://www.ncbi.nlm.nih.gov/gene/824648

enter image description here

Any help would be appreciated!

Thanks!

refseq arabidopsis biomart R • 125 views
ADD COMMENTlink modified 4 weeks ago by Kevin Blighe61k • written 5 weeks ago by eggrandio40
1

biomaRt just interacts with the internal servers at Ensembl, so, just acts as an interface to whatever is held at Ensembl. This gene just appears to not yet have a NM ID, but it is listed under the refseq_peptide attribute.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by Kevin Blighe61k

But when I look it up in the NCBI website, it returns entrez_id 824648, and once there, I can find the NM IDs: NM_001339700.1 and NM_115340.3 corresponding to those two refseq_peptide entries.

I guess it could be because the database at ensembl is not updated? I find it weird, as it looks like the original sequence of the transcript was uploaded in 2016.

Is there any way of querying NCBI database with biomart?

enter image description here

ADD REPLYlink written 5 weeks ago by eggrandio40
1

I see what you mean. It can be that Ensembl's databases have not yet updated - not sure how it works internally. In terms of automated annotation, though, there are basically 2 main ways:

  • biomaRt
  • ord.db packages

Each has pros and cons.

In your case, it seems better to use org.db (see answer below)

ADD REPLYlink written 4 weeks ago by Kevin Blighe61k
2
gravatar for Kevin Blighe
4 weeks ago by
Kevin Blighe61k
Kevin Blighe61k wrote:

Answer:

library(org.At.tair.db)

genes <- c("AT2G14610","AT4G23700","AT3G26830",
  "AT3G15950","AT3G54830","AT5G24105")

keytypes(org.At.tair.db)

 [1] "ARACYC"       "ARACYCENZYME" "ENTREZID"     "ENZYME"       "EVIDENCE"    
 [6] "EVIDENCEALL"  "GENENAME"     "GO"           "GOALL"        "ONTOLOGY"    
[11] "ONTOLOGYALL"  "PATH"         "PMID"         "REFSEQ"       "SYMBOL"      
[16] "TAIR"        


mapIds(org.At.tair.db, keys = genes,
  column = c('ENTREZID'), keytype = 'TAIR')
'select()' returned 1:1 mapping between keys and columns
AT2G14610 AT4G23700 AT3G26830 AT3G15950 AT3G54830 AT5G24105 
 "815949"  "828470"  "822298"  "820839"  "824648" "2745995" 



select(org.At.tair.db, keys = genes,
  column = c('ENTREZID', 'SYMBOL', 'REFSEQ'), keytype = 'TAIR')

'select()' returned 1:many mapping between keys and columns
        TAIR ENTREZID   SYMBOL       REFSEQ
1  AT2G14610   815949    ATPR1    NM_127025
2  AT2G14610   815949    ATPR1    NP_179068
3  AT2G14610   815949       PR    NM_127025
4  AT2G14610   815949       PR    NP_179068
5  AT2G14610   815949      PR1    NM_127025
6  AT2G14610   815949      PR1    NP_179068
7  AT4G23700   828470  ATCHX17 NM_001341626
8  AT4G23700   828470  ATCHX17    NM_118501
9  AT4G23700   828470  ATCHX17 NP_001328705
10 AT4G23700   828470  ATCHX17    NP_194101
11 AT4G23700   828470    CHX17 NM_001341626
12 AT4G23700   828470    CHX17    NM_118501
13 AT4G23700   828470    CHX17 NP_001328705
14 AT4G23700   828470    CHX17    NP_194101
15 AT3G26830   822298 CYP71B15    NM_113595
16 AT3G26830   822298 CYP71B15    NP_189318
17 AT3G26830   822298     PAD3    NM_113595
18 AT3G26830   822298     PAD3    NP_189318
19 AT3G15950   820839     NAI2 NM_001035631
20 AT3G15950   820839     NAI2 NM_001338191
21 AT3G15950   820839     NAI2 NM_001338192
22 AT3G15950   820839     NAI2 NM_001338193
23 AT3G15950   820839     NAI2    NM_112465
24 AT3G15950   820839     NAI2 NP_001030708
25 AT3G15950   820839     NAI2 NP_001326807
26 AT3G15950   820839     NAI2 NP_001326808
27 AT3G15950   820839     NAI2 NP_001326809
28 AT3G15950   820839     NAI2    NP_188216
29 AT3G54830   824648     <NA> NM_001339700
30 AT3G54830   824648     <NA>    NM_115340
31 AT3G54830   824648     <NA> NP_001326240
32 AT3G54830   824648     <NA>    NP_191043
33 AT5G24105  2745995    AGP41    NM_203099
34 AT5G24105  2745995    AGP41    NP_974828
ADD COMMENTlink written 4 weeks ago by Kevin Blighe61k
1

Thanks for your reply!

A while ago I tested both ways (for different queries, related with GOs) and it seemed that ensembl through biomaRt was more complete, but in this case it looks like the contrary.

I guess it depends on the query... I wish there was a unified database.

I might have to write an script that searches both ways and unifies the results.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by eggrandio40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 973 users visited in the last hour