Question: Annotation Of Probeset-Level Data Of Affymetrix Hugene-1_0-St-V1 Chip
gravatar for munch
6.0 years ago by
munch300 wrote:

I have some trouble when I try to annotate the probeset-level data on this particular chip: HuGene-1_0-st-v1. The AffyIDs ranging: 7892501, 7892502, 7892503 ... 8180413, 8180415, 8180417, 8180418

Here are my unsuccessful attempts:

(1) using biomart with getBM function. With this approach 38% of the ~33200 probesets can be annotated

# replace the affyID with gene symbol
mart <- useMart(biomart = "ENSEMBL_MART_ENSEMBL",host = "", path = "/biomart/martservice", dataset = "hsapiens_gene_ensembl")

hgnc <- getBM(attributes = c("affy_hugene_1_0_st_v1", "hgnc_symbol","ensembl_gene_id","entrezgene","chromosome_name","start_position","end_position","band"), filters = "affy_hugene_1_0_st_v1", values=tab$ID, mart = mart)

# Now match the array data probesets with the genes data frame
m <- match(as.numeric(tab$ID), hgnc$affy_hugene_1_0_st_v1)
# And append e.g. the HGNC symbol to the array data frame
tab$hgnc <- hgnc[m, "hgnc_symbol"]

(2) using the NetAffy Annotation file from the Affymetrix Support section [1]. When I compare the ProbeIDs from the first line of the file with the ~33200 ProbeIDs from the experiment, the overlap is only 13%. The AffyIDs are starting with the values 7896739, 7896741, 7896743 ....

(3) Using getSYMBOL(head(fit$genes$ID), "hugene10sttranscriptcluster.db") using library(annotate) and library(hugene10sttranscriptcluster.db) 32% can be annotated, but this annotation seems not to be consistent with (1)

(4) Using (3) but instead of hugene10sttranscriptcluster.db the library hugene10stprobeset.db. Only 0.4% can be annotated due to the fact that hugene10stprobeset.db is for exon annotation


My question: Is there a way to annotate 100% of the AffyIDs with a Gene Symbol? And where are the annotation information for this?

Thank you in advance for your efforts!

annotation biomart affymetrix • 9.1k views
ADD COMMENTlink modified 5.9 years ago by lkmklsmn870 • written 6.0 years ago by munch300
gravatar for munch
6.0 years ago by
munch300 wrote:

Dont know why, but with this approach i successful can annotate 65.8% of the dataset.

annodb <- "hugene10sttranscriptcluster.db"
ID     <- featureNames(eset)
Symbol <- as.character(lookUp(ID, annodb, "SYMBOL"))
Name   <- as.character(lookUp(ID, annodb, "GENENAME"))
Entrez <- as.character(lookUp(ID, annodb, "ENTREZID"))

11363 AffyIDs left without Annotation. Is this normal?

> length(which(Name=="NA"))
[1] 11363
ADD COMMENTlink modified 6.0 years ago • written 6.0 years ago by munch300
gravatar for michaelsbreen1
5.9 years ago by
University of Southampton
michaelsbreen160 wrote:


A large proportion of these 11363 IDs without annotation are reflective of control probes (~4000 of them).

The remaining ~6000 probes without annotation reflect probes with no entrez id annotations. This happens because the probes are designed poorly.

From the 32K ProbeIDs it common to filter this down to roughly 21K of known transcripts.


ADD COMMENTlink written 5.9 years ago by michaelsbreen160
gravatar for lkmklsmn
5.9 years ago by
United States
lkmklsmn870 wrote:

Hi, I am also having some trouble with the annotation of the Affymetrix HuGene ST 1.0 array. From what I understand there is a difference between Probesets and TranscriptClusters. A Probeset is just a collection probes, mainly designed to cover a specific exon, while a TranscriptCluster is a collection of Probesets. That is why the annotation file 'HuGene-1_0-st-v1.na33.2.hg19.probeset.csv' contains about 250k entries while the annotation file 'HuGene-1_0-st-v1.na33.2.hg19.transcript.csv' contains the 33297 entries matching the dimensions of the AffyBatch object. Fortunately the 7 (e.g. 7991762) digit identifiers distinguish between these two so that each 7 digit identifier uniquely maps to either a Probeset or TranscriptCluster. The file 'HuGene-1_0-st-v1.na33.2.hg19.transcript.csv' contains a column titled 'gene_assignment' which contains information about the gene this TranscriptCluster is supposed to cover. Parsing this file will allow you to match about 2/3 of these TranscriptCluster IDs to gene symbols. This matching agrees to a very high extend with the annotation obtained from the Bioconductor R package 'hugene10sttranscriptcluster.db'. Now here is my question, in this 'HuGene-1_0-st-v1.na33.2.hg19.transcript.csv' file some symbols are annotated to more than one TranscriptCluster. This in itself is not surprising since the TranscriptClusters could map different isoforms. However, these TranscriptClusters annotated to the same gene symbol are also annotated to different chromosomes! How can one gene symbol map to multiple chromosomes?

ADD COMMENTlink written 5.9 years ago by lkmklsmn870
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1199 users visited in the last hour