Why bioMart query results in a low coverage of annotations
1
0
Entering edit mode
7 months ago
xiaoyonf ▴ 40

Hi, I tried to use biomart to convert affy_hugene_1_0_st_v1 probe set to gene symbol in R, using the following lines:

mart <- useMart("ENSEMBL_MART_ENSEMBL")
mart <- useDataset("hsapiens_gene_ensembl", mart)
symbol <- getBM(attributes = c("affy_hugene_1_0_st_v1", "hgnc_symbol", "ensembl_gene_id"), filters ="affy_hugene_1_0_st_v1", values=rownames(exprs(gset)), mart=mart)

The probe number is 257430, but the annotated gene number is only 3881 (< 2% coverage).

I appreciate anyone can help me out!

Many thanks,
Xiaoyong

microarray biomart • 287 views
ADD COMMENT
1
Entering edit mode
7 months ago

You have the HuGene 1.0 ST but how has the data been processed? 257430 is a very large number, so, it tells me that the data is still at the probeset level.

How do the following annotation measures fair in comparison to biomaRt?

require(hugene10sttranscriptcluster.db)
annotLookup <- select(hugene10sttranscriptcluster.db, keys = rownames(exprs(gset)),
  columns = c('PROBEID', 'ENSEMBL', 'SYMBOL'))

require(hugene10stprobeset.db)
annotLookup <- select(hugene10stprobeset.db, keys = rownames(exprs(gset)),
  columns = c('PROBEID', 'ENSEMBL', 'SYMBOL'))
ADD COMMENT
0
Entering edit mode

Hi Kevin,

Thank you for your prompt reply. I downloaded this dataset from GEO by (as suggested in one of your prior posts):

gset <- getGEO("GSE49124", GSEMatrix = TRUE, getGPL = FALSE) if (length(gset) > 1) idx <- grep("GPL10739", attr(gset, "names")) else idx <- 1

gset <- gset[[idx]]

Yes, the data is still at the probe set level, with a range of probe ID from 7892501 to 8180418. Could you please explain why the biomart query did not work? I will try the methods you suggested; but I have a problem to install the hugene10sttranscriptcluster.db in my R4.0.0. I will update if this works. Thanks!

ADD REPLY
0
Entering edit mode

I have had similar issues with this array in the past, in terms of annotation. If I recall correctly, there are both probeset and transcript cluster IDs, but the way in which they are assigned makes it difficult. This said, I have never had issues when annotating via hugene10sttranscriptcluster.db or hugene10stprobeset.db - these are manually-curated database packages by James (Bioconductor).

ADD REPLY
0
Entering edit mode

Hi Kevin,

Updates for your suggested annotation measures: I used hugene10stprobeset.db to annotate this array's probes and got almost 100% coverage. I noticed that many of the probes actually map same gene and vice versa, which is I think due to the nature of this array as an exon array. For the following DGE analysis, I used the mean value of the all assigned probes for each gene. Do you think it is OK? For the hugene10sttranscriptcluster.db annotations, oddly, it only mapped very few genes (~200), which I don't know why. Thanks again for your answer!

ADD REPLY

Login before adding your answer.

Traffic: 2500 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6