Question: Human Gene 2.1 ST Array annotation - How to import .csv annotation file properly
0
gravatar for regisantonioli
7 weeks ago by
regisantonioli0 wrote:

Hello,

I'll start by saying I am pretty much a beginner at this, and I'm pretty much going solo trying to learn data analysis using R. Currently I am working with an Affimetrix microarray dataset, more specifically their Human Gene 2.1 ST Array.

But apparently that was a very bad idea, as I'm facing a huge roadblock in trying to annotate the probe IDs properly. I've found annotation guides that cover all of their arrays but never specifically the human 2.1, and it is very frustrating. Even the Oligo package's (which is the package needed to analyze their data) actual guide wasn't helpful as it is badly formatted and parts of it are even cropped in the margin, making it impossible to follow.

After some searching, I found a product page for Human Gene 2.1 ST Array strip and they had some supplementary files that included annotation files in .csv

List of supplementary files, I downloaded the first one under "Support files"

Now I need to import that file in an usable way into the R environment, and that's where I'm stuck now and could use some help. I know the right function to use would be "read.csv" but I don't know how to set it up in a way that it imports the file in the correct way without being completely broken.

annotation human microarray R • 230 views
ADD COMMENTlink modified 7 weeks ago • written 7 weeks ago by regisantonioli0
1

Please see How to add images to a Biostars post to add your images properly. You need the direct link to the image, not the link to the webpage that has the image embedded (which is what you have used here)

ADD REPLYlink written 7 weeks ago by RamRS24k
2
gravatar for Kevin Blighe
7 weeks ago by
Kevin Blighe51k
Kevin Blighe51k wrote:

Hey,

You will want the NetAffx file from the following link: Human Gene ST Array Plates - Support Materials. The actual file that you'll need is called 'HuGene-2_1-st-v1 Transcript Cluster Annotations, CSV, Release 36' (you may have to sign up in order to download the file).

Once you get it, unpack it and then you should be able to read it in by simply doing:

annotation <- read.table(
  file = "HuGene-2_1-st-v1.na36.hg19.probeset.csv",
  header = TRUE,
  sep = ",",
  quote = "\"",
  dec = ".")

Control probesets

You can identify control probeset IDs by searching the probeset_type column, e.g.: of control probes

idx <- grep(
  "control->affx|control->affx->bac_spike|control->affx->ercc|control->affx->polya_spike|control->bgp->antigenomic|normgene->intron|Reporter",
  annotation$probeset_type)

Gene symbols

You can also isolate gene symbols:

gene <- lapply(
  strsplit(as.character(annotation$gene_assignment), split = " /// | // "),
  function(x) x[2])
gene <- unlist(
  lapply(
    gene,
    function(x) gsub("---;NA", NA, paste(x, collapse = ";"))))
gene[gene == "NA"] <- NA

After that, you should be able to match the annotation to your main expression data via the probeset_id and / or transcript_cluster_id columns.

Kevin

ADD COMMENTlink modified 7 weeks ago • written 7 weeks ago by Kevin Blighe51k

Sorry for the late response, but thank you so much!

It did indeed help me greatly by allowing me to import the file to the R workspace. Just a few questions:

  • There you said to download 'HuGene-2_1-st-v1 Transcript Cluster Annotations, CSV, Release 36', but in the actual code, you used the Probeset annotations. When I did myself I used the Transcript cluster for both, but which is the actual correct one?

  • Tied to the last question, the control probe code only works if I use the 'Probeset' annotation, is that right?

  • And finally, even though I did manage to import the file, I couldn't properly match the annotation to the expression data, so if you have any light on that I appreciate it.

Thanks again for the help, I noticed looking through other questions around here that you help a lot of people!

ADD REPLYlink written 7 weeks ago by regisantonioli0

Just doing my bit to help the community - bioinformatics is the Wild Wild West, after all, and a lot of folk are on their own.

.

Point 1

The annotation of the ST arrays can be annoying, I admit. You have to account for the fact that, after normalisation, some data-points will have either a probeset or transcript cluster ID. How are you normalising - I mean, what is your exact oligo::rma() command?

Point 2

I believe the control probes will only have a probeset ID, yes, but please double check. The ST array processing made me nervous, but so does everything in bioinformatics.

Point 3

Which are the ones that do not match? - check again via probeset and transcript cluster ID.

-------------------------------------

Here, this may help a bit further. This is from the automated pipeline that I use for ST array processing. In the final code chunk, I create a merged annotation lookup table that has everything duplicated for both probset and transcript cluster ID, i.e., to ensure that everything that can possibly match is matched:

  chip <- annotation(data)
  annotation <- read.table(file = paste0("Analysis/netaffx/", chip, "_annot.csv"), header = T, sep = ",",
    quote = "\"", dec = ".")

  # get IDs of control probes
  idx <- grep("control->affx|control->affx->bac_spike|control->affx->ercc|control->affx->polya_spike|control->bgp->antigenomic|normgene->intron|Reporter", annotation$probeset_type)
  controlIDs <- unique(c(annotation[idx,"probeset_id"], annotation[idx,"transcript_cluster_id"]))

  # remove control probes
  annotation <- annotation[-which(annotation$transcript_cluster_id %in% controlIDs),]

  # extract gene symbol information and convert to Entrez
  gene <- lapply(strsplit(as.character(annotation$gene_assignment), split = " /// | // "), function(x) x[2])
  gene <- unlist(lapply(gene, function(x) gsub("---;NA", NA, paste(x, collapse = ";"))))
  gene[gene == "NA"] <- NA

  require(biomaRt)
  mart <- useMart("ENSEMBL_MART_ENSEMBL")
  mart <- useDataset("hsapiens_gene_ensembl", mart)
  #mart <- useDataset("mmusculus_gene_ensembl", mart)
  annots <- getBM(
    mart=mart,
    attributes=c("hgnc_symbol", "entrezgene"),
    filter="hgnc_symbol",
    values = gene,
    uniqueRows=TRUE)
  #annots <- getBM(
    #mart=mart,
    #attributes=c("mgi_symbol", "entrezgene"),
    #filter="mgi_symbol",
    #values = gene,
    #uniqueRows=TRUE)
  annots <- annots[!duplicated(annots[,1]),]
  annots <- data.frame(
    annots[match(gene, annots$hgnc_symbol),], gene)
  #annots <- data.frame(
    #annots[match(gene, annots$mgi_symbol),], gene)

  # some probesets, after normalisation, are assignd probeset ID, and not transcript cluster
  # Thus, we create a merged annotation using both to account for these
  # NB - probeset == transcript cluster ID for control probes only; so, this is not an issue.
  annots <- rbind(
    data.frame(
      ID = annotation$transcript_cluster_id,
      Gene.Title = rep(NA, nrow(annots)),
      Gene.Symbol = annots$hgnc_symbol, #Gene.Symbol = annots$mgi_symbol,
      Gene.Symbol.Alt = gene,
      Entrez.Gene = annots$entrezgene,
      Chromosomal.Location = paste0(annotation$seqname, ":", annotation$start, "-", annotation$stop)),
    data.frame(
      ID = annotation$probeset_id,
      Gene.Title = rep(NA, nrow(annots)),
      Gene.Symbol = annots$hgnc_symbol, #Gene.Symbol = annots$mgi_symbol,
      Gene.Symbol.Alt = gene,
      Entrez.Gene = annots$entrezgene,
      Chromosomal.Location = paste0(annotation$seqname, ":", annotation$start, "-", annotation$stop)))

  annotation <- annots
ADD REPLYlink modified 7 weeks ago • written 7 weeks ago by Kevin Blighe51k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1781 users visited in the last hour