Human Gene 2.1 ST Array annotation - How to import .csv annotation file properly
Entering edit mode
20 months ago


I'll start by saying I am pretty much a beginner at this, and I'm pretty much going solo trying to learn data analysis using R. Currently I am working with an Affimetrix microarray dataset, more specifically their Human Gene 2.1 ST Array.

But apparently that was a very bad idea, as I'm facing a huge roadblock in trying to annotate the probe IDs properly. I've found annotation guides that cover all of their arrays but never specifically the human 2.1, and it is very frustrating. Even the Oligo package's (which is the package needed to analyze their data) actual guide wasn't helpful as it is badly formatted and parts of it are even cropped in the margin, making it impossible to follow.

After some searching, I found a product page for Human Gene 2.1 ST Array strip and they had some supplementary files that included annotation files in .csv

List of supplementary files, I downloaded the first one under "Support files"

Now I need to import that file in an usable way into the R environment, and that's where I'm stuck now and could use some help. I know the right function to use would be "read.csv" but I don't know how to set it up in a way that it imports the file in the correct way without being completely broken.

R microarray human annotation • 1.3k views
Entering edit mode

Please see How to add images to a Biostars post to add your images properly. You need the direct link to the image, not the link to the webpage that has the image embedded (which is what you have used here)

Entering edit mode
20 months ago


You will want the NetAffx file from the following link: Human Gene ST Array Plates - Support Materials. The actual file that you'll need is called 'HuGene-2_1-st-v1 Transcript Cluster Annotations, CSV, Release 36' (you may have to sign up in order to download the file).

Once you get it, unpack it and then you should be able to read it in by simply doing:

annotation <- read.table(
  file = "HuGene-2_1-st-v1.na36.hg19.probeset.csv",
  header = TRUE,
  sep = ",",
  quote = "\"",
  dec = ".")

Control probesets

You can identify control probeset IDs by searching the probeset_type column, e.g.: of control probes

idx <- grep(

Gene symbols

You can also isolate gene symbols:

gene <- lapply(
  strsplit(as.character(annotation$gene_assignment), split = " /// | // "),
  function(x) x[2])
gene <- unlist(
    function(x) gsub("---;NA", NA, paste(x, collapse = ";"))))
gene[gene == "NA"] <- NA

After that, you should be able to match the annotation to your main expression data via the probeset_id and / or transcript_cluster_id columns.


Entering edit mode

Sorry for the late response, but thank you so much!

It did indeed help me greatly by allowing me to import the file to the R workspace. Just a few questions:

  • There you said to download 'HuGene-2_1-st-v1 Transcript Cluster Annotations, CSV, Release 36', but in the actual code, you used the Probeset annotations. When I did myself I used the Transcript cluster for both, but which is the actual correct one?

  • Tied to the last question, the control probe code only works if I use the 'Probeset' annotation, is that right?

  • And finally, even though I did manage to import the file, I couldn't properly match the annotation to the expression data, so if you have any light on that I appreciate it.

Thanks again for the help, I noticed looking through other questions around here that you help a lot of people!

Entering edit mode

Just doing my bit to help the community - bioinformatics is the Wild Wild West, after all, and a lot of folk are on their own.


Point 1

The annotation of the ST arrays can be annoying, I admit. You have to account for the fact that, after normalisation, some data-points will have either a probeset or transcript cluster ID. How are you normalising - I mean, what is your exact oligo::rma() command?

Point 2

I believe the control probes will only have a probeset ID, yes, but please double check. The ST array processing made me nervous, but so does everything in bioinformatics.

Point 3

Which are the ones that do not match? - check again via probeset and transcript cluster ID.


Here, this may help a bit further. This is from the automated pipeline that I use for ST array processing. In the final code chunk, I create a merged annotation lookup table that has everything duplicated for both probset and transcript cluster ID, i.e., to ensure that everything that can possibly match is matched:

  chip <- annotation(data)
  annotation <- read.table(file = paste0("Analysis/netaffx/", chip, "_annot.csv"), header = T, sep = ",",
    quote = "\"", dec = ".")

  # get IDs of control probes
  idx <- grep("control->affx|control->affx->bac_spike|control->affx->ercc|control->affx->polya_spike|control->bgp->antigenomic|normgene->intron|Reporter", annotation$probeset_type)
  controlIDs <- unique(c(annotation[idx,"probeset_id"], annotation[idx,"transcript_cluster_id"]))

  # remove control probes
  annotation <- annotation[-which(annotation$transcript_cluster_id %in% controlIDs),]

  # extract gene symbol information and convert to Entrez
  gene <- lapply(strsplit(as.character(annotation$gene_assignment), split = " /// | // "), function(x) x[2])
  gene <- unlist(lapply(gene, function(x) gsub("---;NA", NA, paste(x, collapse = ";"))))
  gene[gene == "NA"] <- NA

  mart <- useMart("ENSEMBL_MART_ENSEMBL")
  mart <- useDataset("hsapiens_gene_ensembl", mart)
  #mart <- useDataset("mmusculus_gene_ensembl", mart)
  annots <- getBM(
    attributes=c("hgnc_symbol", "entrezgene"),
    values = gene,
  #annots <- getBM(
    #attributes=c("mgi_symbol", "entrezgene"),
    #values = gene,
  annots <- annots[!duplicated(annots[,1]),]
  annots <- data.frame(
    annots[match(gene, annots$hgnc_symbol),], gene)
  #annots <- data.frame(
    #annots[match(gene, annots$mgi_symbol),], gene)

  # some probesets, after normalisation, are assignd probeset ID, and not transcript cluster
  # Thus, we create a merged annotation using both to account for these
  # NB - probeset == transcript cluster ID for control probes only; so, this is not an issue.
  annots <- rbind(
      ID = annotation$transcript_cluster_id,
      Gene.Title = rep(NA, nrow(annots)),
      Gene.Symbol = annots$hgnc_symbol, #Gene.Symbol = annots$mgi_symbol,
      Gene.Symbol.Alt = gene,
      Entrez.Gene = annots$entrezgene,
      Chromosomal.Location = paste0(annotation$seqname, ":", annotation$start, "-", annotation$stop)),
      ID = annotation$probeset_id,
      Gene.Title = rep(NA, nrow(annots)),
      Gene.Symbol = annots$hgnc_symbol, #Gene.Symbol = annots$mgi_symbol,
      Gene.Symbol.Alt = gene,
      Entrez.Gene = annots$entrezgene,
      Chromosomal.Location = paste0(annotation$seqname, ":", annotation$start, "-", annotation$stop)))

  annotation <- annots

Login before adding your answer.

Traffic: 2171 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6