Question: Entrez IDs disappear when using Biomart with GRCh37 Genome version
0
gravatar for luisa
5 months ago by
luisa10
luisa10 wrote:

Hi! I'm trying to get the location (chromosome and band) of a list of Entrez Gene IDs I got using the Homo.sapiens Bioconductor package:

indx <- findOverlaps(genes(TxDb.Hsapiens.UCSC.hg19.knownGene), mycoords.gr)

Since my original data mycoords.gr) are mapped to the GRCh37/hg19 genome version, I tried using Biomart to get the locations using that version of the genome:

ensembl <-useMart(biomart="ENSEMBL_MART_ENSEMBL", host="grch37.ensembl.org", path="/biomart/martservice", dataset="hsapiens_gene_ensembl")

my.symbols <- indx$gene_id

my.regions <- getBM(c("entrezgene","hgnc_symbol", "chromosome_name", "band"),
                    filters = "entrezgene",
                    values = my.symbols,
                    mart = ensembl)

I noticed, however, that some of the Entrez IDs that were on my list were not on "my.regions". When I tried using the current version of the genome, those IDs were present but others were missing...

Is there a difference in Entrez IDs between assemblies? I also tried retrieving all of the Entrez IDs in ensembl and some of them were also missing...

mapping <- getBM(attributes = c("entrezgene", "hgnc_symbol"), mart = ensembl)

I don't understand this... Is there an alternative to this method?

Thanks in advance!

biomart assembly genome • 403 views
ADD COMMENTlink modified 5 months ago by Emily_Ensembl19k • written 5 months ago by luisa10

Can you give some examples of IDs that were in the wrong locations or missing, please?

ADD REPLYlink modified 5 months ago • written 5 months ago by Emily_Ensembl19k

Yes, some of the missing ones were 100033416 (hg18) and 10002(hg19) and 100033416 in both

ADD REPLYlink modified 5 months ago • written 5 months ago by luisa10

I am able to detect the one that you have tagged as hg19:

ensembl <-useMart(biomart="ENSEMBL_MART_ENSEMBL", host="grch37.ensembl.org", path="/biomart/martservice", dataset="hsapiens_gene_ensembl")
getBM(mart=mart, attributes=c("hgnc_symbol", "entrezgene"), filter="entrezgene", values=c("100033416","10002"), uniqueRows=TRUE)
hgnc_symbol entrezgene
    1       NR2E3      10002
ADD REPLYlink modified 5 months ago • written 5 months ago by Kevin Blighe48k

hgnc_symbol entrezgene

1 SNHG14 100033416

This is what I get when I run the same code you wrote ...

ADD REPLYlink modified 5 months ago • written 5 months ago by luisa10

We have a problem...

ADD REPLYlink written 5 months ago by Kevin Blighe48k

Did you mean hg18/NCBI36 or did you mean GRCh38?

ADD REPLYlink written 5 months ago by Emily_Ensembl19k

I think it is GRCh38

ADD REPLYlink written 5 months ago by luisa10

Someone will be along with a BioMart answer but if you can post a few entrez ID's we can see if an entrezdirect solution is feasible.

ADD REPLYlink written 5 months ago by genomax71k
2
gravatar for Emily_Ensembl
5 months ago by
Emily_Ensembl19k
EMBL-EBI
Emily_Ensembl19k wrote:

BioMart provides mappings from Ensembl genes to external references, it does not provide direct mappings between non-Ensembl identifiers. This means that when you look up NCBI -> HGNC mappings, you're actually looking up NCBI -> Ensembl -> HGNC mappings.

NCBI 10002 does not map to the Ensembl gene ENSG00000031544 in GRCh37 because they have different biotypes, in Ensembl the gene is non-coding. It's non-coding in Ensembl on GRCh37 because Ensembl annotation is based on the genome, the gene sequences have to match the reference genome. NCBI do not have this constraint in their annotation. Because the GRCh37 assembly is flawed, there is no ORF and the gene could only be annotated in Ensembl as non-coding. If you compare the genomic region in GRCh37 to that in GRCh38 you'll see that a number of small contigs (the track with alternating shades of blue) have been introduced in GRCh38, which have fixed the underlying genome, and now the gene is listed as coding and NCBI 10002 is listed as an external reference. This is why we recommend always using the most up-to-date genome assembly.

100033416 does appear in GRCh37 but not in GRCh38. It seems to be one of a load of snoRNAs mapped to ENSG00000224078, which all seem to be small RNAs overlapping a much larger one. None of these are present in GRCh38, which an improvement, I think. Again, this is why we recommend using the most up-to-date data. It looks like the correct match should be to ENSG00000275529, so I'll feed that back to our developers.

Mapping between databases, especially for short sequences in repetitive regions, is quite a difficult problem. We are working with NCBI at the moment to improve our mapping with them, and hopefully this will improve in future.

ADD COMMENTlink modified 5 months ago • written 5 months ago by Emily_Ensembl19k

I will use GRCh38 since it is the latest version. Thank you very much for your explanation!

ADD REPLYlink written 5 months ago by luisa10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2213 users visited in the last hour