There's a few things that might be going on, and it's hard to tell exactly without some examples of the missing or duplicated gene IDs. Here's some ideas though.
BioMart will silently drop any element in values
that aren't found in the query. There's no error or anything, you just don't get a hit. That's easy to see with a single value, harder to spot in 23,000:
## query not found in Ensembl
getBM(values = c("ENSG_NOT_REAL"),
filter = "ensembl_gene_id",
attributes = c("ensembl_gene_id", "hgnc_symbol"),
mart = mart)
#> [1] ensembl_gene_id hgnc_symbol
#> <0 rows> (or 0-length row.names)
You can try to identify what input values aren't returned in the results with something like genes[ !genes %in% G_list$ensembl_gene_id ]
. If that finds something I'd search the Ensembl website manually with a few of the IDs and try to understand why they might be missing from BioMart e.g. they might be from an old Ensembl version and have been retired - there are probably many possible reasons.
For completeness I'll also point out that Ensembl BioMart will ignore duplicate entries in the the values
argument e.g..
## duplicated input values
getBM(values = c("ENSG00000010404", "ENSG00000010404"),
filter = "ensembl_gene_id",
attributes = c("ensembl_gene_id", "hgnc_symbol"),
mart = mart)
#> ensembl_gene_id hgnc_symbol
#> 1 ENSG00000010404 IDS
However it looks like you've already checked this isn't the case in your data.
Regarding the duplicated entries in the results, this can occur if there is a one-to-many mapping between the two ID types you're trying to find e.g.
## one-to-many mapping
getBM(values = "ENSG00000277796",
filter = "ensembl_gene_id",
attributes = c("ensembl_gene_id", "hgnc_symbol"),
mart = mart)
#> ensembl_gene_id hgnc_symbol
#> 1 ENSG00000277796 CCL3L3
#> 2 ENSG00000277796 CCL3L1
Mapping between IDs from different organisations is never perfect and it's pretty common to see instances like this, where a single Ensembl ID maps to two HGNC IDs (or vice versa). You could try to identify the duplicated entries with
G_list[ duplicated(G_list$ensembl_gene_id) | duplicated(G_list$ensembl_gene_id, fromLast = TRUE), ]