Question

Why bioMart query results in a low coverage of annotations

0

Entering edit mode

3.7 years ago

xiaoyonf ▴ 60

Hi, I tried to use biomart to convert affy_hugene_1_0_st_v1 probe set to gene symbol in R, using the following lines:

mart <- useMart("ENSEMBL_MART_ENSEMBL")
mart <- useDataset("hsapiens_gene_ensembl", mart)
symbol <- getBM(attributes = c("affy_hugene_1_0_st_v1", "hgnc_symbol", "ensembl_gene_id"), filters ="affy_hugene_1_0_st_v1", values=rownames(exprs(gset)), mart=mart)

The probe number is 257430, but the annotated gene number is only 3881 (< 2% coverage).

I appreciate anyone can help me out!

Many thanks,
Xiaoyong

microarray biomart • 1.3k views

ADD COMMENT • link updated 3.7 years ago by Ram 43k • written 3.7 years ago by xiaoyonf ▴ 60

score 1 · Answer 1 · 2020-08-20

1

Entering edit mode

3.7 years ago

Kevin Blighe 87k

You have the HuGene 1.0 ST but how has the data been processed? 257430 is a very large number, so, it tells me that the data is still at the probeset level.

How do the following annotation measures fair in comparison to biomaRt?

require(hugene10sttranscriptcluster.db)
annotLookup <- select(hugene10sttranscriptcluster.db, keys = rownames(exprs(gset)),
  columns = c('PROBEID', 'ENSEMBL', 'SYMBOL'))

require(hugene10stprobeset.db)
annotLookup <- select(hugene10stprobeset.db, keys = rownames(exprs(gset)),
  columns = c('PROBEID', 'ENSEMBL', 'SYMBOL'))

ADD COMMENT • link 3.7 years ago by Kevin Blighe 87k

0

Entering edit mode

Hi Kevin,

Thank you for your prompt reply. I downloaded this dataset from GEO by (as suggested in one of your prior posts):

gset <- getGEO("GSE49124", GSEMatrix = TRUE, getGPL = FALSE) if (length(gset) > 1) idx <- grep("GPL10739", attr(gset, "names")) else idx <- 1

gset <- gset[[idx]]

Yes, the data is still at the probe set level, with a range of probe ID from 7892501 to 8180418. Could you please explain why the biomart query did not work? I will try the methods you suggested; but I have a problem to install the hugene10sttranscriptcluster.db in my R4.0.0. I will update if this works. Thanks!

ADD REPLY • link 3.7 years ago by xiaoyonf ▴ 60

0

Entering edit mode

I have had similar issues with this array in the past, in terms of annotation. If I recall correctly, there are both probeset and transcript cluster IDs, but the way in which they are assigned makes it difficult. This said, I have never had issues when annotating via hugene10sttranscriptcluster.db or hugene10stprobeset.db - these are manually-curated database packages by James (Bioconductor).

ADD REPLY • link 3.7 years ago by Kevin Blighe 87k

0

Entering edit mode

Hi Kevin,

Updates for your suggested annotation measures: I used hugene10stprobeset.db to annotate this array's probes and got almost 100% coverage. I noticed that many of the probes actually map same gene and vice versa, which is I think due to the nature of this array as an exon array. For the following DGE analysis, I used the mean value of the all assigned probes for each gene. Do you think it is OK? For the hugene10sttranscriptcluster.db annotations, oddly, it only mapped very few genes (~200), which I don't know why. Thanks again for your answer!

ADD REPLY • link 3.7 years ago by xiaoyonf ▴ 60