Question

Annotation Of Probeset-Level Data Of Affymetrix Hugene-1_0-St-V1 Chip

5

Entering edit mode

11.0 years ago

munch ▴ 310

I have some trouble when I try to annotate the probeset-level data on this particular chip: HuGene-1_0-st-v1. The AffyIDs ranging: 7892501, 7892502, 7892503 ... 8180413, 8180415, 8180417, 8180418

Here are my unsuccessful attempts:

(1) using biomart with getBM function. With this approach 38% of the ~33200 probesets can be annotated

# replace the affyID with gene symbol
mart <- useMart(biomart = "ENSEMBL_MART_ENSEMBL",host = "www.ensembl.org", path = "/biomart/martservice", dataset = "hsapiens_gene_ensembl")

hgnc <- getBM(attributes = c("affy_hugene_1_0_st_v1", "hgnc_symbol","ensembl_gene_id","entrezgene","chromosome_name","start_position","end_position","band"), filters = "affy_hugene_1_0_st_v1", values=tab$ID, mart = mart)

# Now match the array data probesets with the genes data frame
m <- match(as.numeric(tab$ID), hgnc$affy_hugene_1_0_st_v1)
# And append e.g. the HGNC symbol to the array data frame
tab$hgnc <- hgnc[m, "hgnc_symbol"]

(2) using the NetAffy Annotation file from the Affymetrix Support section [1]. When I compare the ProbeIDs from the first line of the file with the ~33200 ProbeIDs from the experiment, the overlap is only 13%. The AffyIDs are starting with the values 7896739, 7896741, 7896743 ....

(3) Using getSYMBOL(head(fit$genes$ID), "hugene10sttranscriptcluster.db") using library(annotate) and library(hugene10sttranscriptcluster.db) 32% can be annotated, but this annotation seems not to be consistent with (1)

(4) Using (3) but instead of hugene10sttranscriptcluster.db the library hugene10stprobeset.db. Only 0.4% can be annotated due to the fact that hugene10stprobeset.db is for exon annotation

[1] http://www.affymetrix.com/Auth/analysis/downloads/na33/wtgene-32_2/HuGene-1_0-st-v1.na33.2.hg19.probeset.csv.zip

My question: Is there a way to annotate 100% of the AffyIDs with a Gene Symbol? And where are the annotation information for this?

Thank you in advance for your efforts!

affymetrix annotation biomart • 14k views

ADD COMMENT • link updated 10.9 years ago by lkmklsmn ▴ 970 • written 11.0 years ago by munch ▴ 310

score 7 · Answer 1 · 2013-04-22

Dont know why, but with this approach i successful can annotate 65.8% of the dataset.

library(annotate)
library(hugene10sttranscriptcluster.db)
annodb <- "hugene10sttranscriptcluster.db"
ID     <- featureNames(eset)
Symbol <- as.character(lookUp(ID, annodb, "SYMBOL"))
Name   <- as.character(lookUp(ID, annodb, "GENENAME"))
Entrez <- as.character(lookUp(ID, annodb, "ENTREZID"))

11363 AffyIDs left without Annotation. Is this normal?

> length(which(Name=="NA"))
[1] 11363

score 2 · Answer 2 · 2013-05-20

Hi,

A large proportion of these 11363 IDs without annotation are reflective of control probes (~4000 of them).

The remaining ~6000 probes without annotation reflect probes with no entrez id annotations. This happens because the probes are designed poorly.

From the 32K ProbeIDs it common to filter this down to roughly 21K of known transcripts.

Michael

score 2 · Answer 3 · 2013-05-28

Hi, I am also having some trouble with the annotation of the Affymetrix HuGene ST 1.0 array. From what I understand there is a difference between Probesets and TranscriptClusters. A Probeset is just a collection probes, mainly designed to cover a specific exon, while a TranscriptCluster is a collection of Probesets. That is why the annotation file 'HuGene-1_0-st-v1.na33.2.hg19.probeset.csv' contains about 250k entries while the annotation file 'HuGene-1_0-st-v1.na33.2.hg19.transcript.csv' contains the 33297 entries matching the dimensions of the AffyBatch object. Fortunately the 7 (e.g. 7991762) digit identifiers distinguish between these two so that each 7 digit identifier uniquely maps to either a Probeset or TranscriptCluster. The file 'HuGene-1_0-st-v1.na33.2.hg19.transcript.csv' contains a column titled 'gene_assignment' which contains information about the gene this TranscriptCluster is supposed to cover. Parsing this file will allow you to match about 2/3 of these TranscriptCluster IDs to gene symbols. This matching agrees to a very high extend with the annotation obtained from the Bioconductor R package 'hugene10sttranscriptcluster.db'. Now here is my question, in this 'HuGene-1_0-st-v1.na33.2.hg19.transcript.csv' file some symbols are annotated to more than one TranscriptCluster. This in itself is not surprising since the TranscriptClusters could map different isoforms. However, these TranscriptClusters annotated to the same gene symbol are also annotated to different chromosomes! How can one gene symbol map to multiple chromosomes?