Question: [HuEx-1_0-st] Affymetrix Human Exon 1.0 ST Array [transcript (gene) version]
1
gravatar for vinnu260
3 months ago by
vinnu26010
vinnu26010 wrote:

Hi, im working on Exon arrays i.e.. [HuEx-1_0-st] Affymetrix Human Exon 1.0 ST Array [transcript (gene) version] - GPL5175. I did RMA with target="core" option using oligo package. i got 22011 probe clusters, which im trying to annotate. Is the procedure correct for normalization ? Which package should i use to annotate this data ? Thanks in advance.

ADD COMMENTlink modified 3 months ago by Kevin Blighe21k • written 3 months ago by vinnu26010
1
gravatar for Kevin Blighe
3 months ago by
Kevin Blighe21k
University College London Cancer Institute
Kevin Blighe21k wrote:

Yes, well, target="core" summarises the expression values into transcript (i.e. gene)-level values; whereas, target="probeset" will leave the expression values at the level of probe-sets. One or may probe-sets may target an individual exon, for example.

For annotation, I do not recommend any package. I recommend that you do a manual curation by obtaining the data straight from the source:

  1. Go here
  2. Under section Current NetAffx Annotation Files, download the file HuGene-1_0-st-v1 Transcript Cluster Annotations, CSV, Release 36 (79 MB, 7/6/16)
  3. Unzip the archive, look over the file (even in Excel) and get a feeling of what's in it (it is very comprehensive)
  4. Do the manual annotation within R by reading in the annotation and matching your transcript IDs to the IDs in the annotation

Not everything should be automated, of course, and annotation packages can be a particular nuisance.

If it helps, I have put some code specifically for this here in an un-responded old thread on Biostars: HuGene-1_1-st accessory files for microarray analysis

Kevin

ADD COMMENTlink modified 8 days ago • written 3 months ago by Kevin Blighe21k

Thanks for the response.

Can you please tell me the difference between 1) [HuEx-1_0-st] Affymetrix Human Exon 1.0 ST Array transcript (gene) version vs [HuEx-1_0-st] Affymetrix Human Exon 1.0 ST Array probe set (exon) version ? Im using the transcript version though both are available. 2) HuGene-1_0-st-v1 Transcript Cluster Annotations vs HuGene-1_0-st-v2 Transcript Cluster Annotations ? I tried using the v2 version of the same and ended up with clusters mapping to mutiple genes including miRNAs and few thousands of control clusters.

After annotating with HuGene-1_0-st-v2 Transcript Cluster Annotations file and removing clusters mapping multiple genes/control probes i ended up with 15500 genes out of 22011 transcript clusters. I expected them to map one to one at transcript level. Does this make sense or am i doing it wrong ?

The thread you mentioned talks about [HuGene-1_1-st] Affymetrix Human Gene 1.1 ST Array transcript (gene) version. Is there any other way of annotating transcript version of this exon arrays(GPL5175 and GPL5188) ? This article(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5263238/) uses GPL5175 and GPL5188 arrays but they got 26,493 transcriptswhich is way more than what im having(15500 transcripts). Can you help me understand how the array yielded 26000 transcripts ?

Thank you for the help.

ADD REPLYlink written 3 months ago by vinnu26010

Can you please tell me the difference between 1) [HuEx-1_0-st] Affymetrix Human Exon 1.0 ST Array transcript (gene) version vs [HuEx-1_0-st] Affymetrix Human Exon 1.0 ST Array probe set (exon) version ?

Just like the differences between target="core" and target="probeset", there are also different respective annotations, one at the transcript level and the one with exon-level annotation.

The other questions that you have are precisely why the use of an automated annotation tool/package should be avoided. These packages will make these filtering decisions for you, even if they are not what you want for your experiment, exactly.

What I did was spent a few hours reviewing the annotation CSV and then creating a sub-set with just those annotations in which I was interested. So, out went the control genes, predicted (but not validated genes), and then a whole bunch of other junk that has no relevance.

After then normalising the data within R, I filtered out any transcripts that were not in my custom-made annotation file. I still had around 36,000 transcripts, mixed between coding and non-coding.

The name of the file that you've downloaded should be HuGene-2_1-st-v1.na36.hg19.transcript.csv, right?

ADD REPLYlink written 3 months ago by Kevin Blighe21k

Sorry for the wrong name. It was HuEx-1_0-st-v2 Transcript Cluster Annotations, CSV, Release 36 (95 MB, 7/6/16) (http://www.affymetrix.com/support/technical/byproduct.affx?product=huexon-st) i.e.. GPL5175. Can you share the R code you are using to preprocess ? Thank You.

ADD REPLYlink written 3 months ago by vinnu26010

Do you still need this?

ADD REPLYlink written 8 days ago by Kevin Blighe21k

Yeah if an answer is available

ADD REPLYlink written 8 days ago by vinnu26010

Here you go Chief:

Okay, first create a file like this, which lists your samples and phenotyes:

Targets.txt

FileName                                    SampleID   Group
SampleFiles/1_CS0911a_(HuGene-2_0-st).CEL   CS0911a    Disease
SampleFiles/10_CS0812d_(HuGene-2_0-st).CEL  CS0812d    Disease
SampleFiles/11_CS0812e_(HuGene-2_0-st).CEL  CS0812e    Control
SampleFiles/12_CS0812f_(HuGene-2_0-st).CEL  CS0812f    Control
SampleFiles/13_CS0801a_(HuGene-2_0-st).CEL  CS0801a    Control
SampleFiles/14_CS0801b_(HuGene-2_0-st).CEL  CS0801b    Disease
SampleFiles/15_CS0801c_(HuGene-2_0-st).CEL  CS0801c    Control

Then, in R:

library("limma")
library("oligo")

#Disable scientific notation
options(scipen=999)

#Read in the data into a dataframe
#readTargets will by default look for the 'FileName' column in the specified file
targetinfo <- readTargets("Targets.txt", sep="\t")
CELFiles <- list.celfiles("SampleFiles/", full.names = TRUE)

NB - Double check that the order of files in CELFiles matches the order in Targets.txt

#Read in raw intensity data
project <- read.celfiles(CELFiles)

#Background correct, normalize, and calculate gene expression
#summarise expression over genes
project.bgcorrect.norm.avg <- rma(project, background=TRUE, normalize=TRUE, target="core")

#Perform some diagnostics on the arrays
#Generate raw chip images to diagnose spatial artefacts for each array and a merged boxplot of intensities
pdf("Output/ChipImageQC.pdf")
    image(project)
dev.off()

pdf("Output/BoxPlotQC.pdf")
    par(mar=c(5,5,5,5), cex=1, cex.axis=0.8, mfrow=c(2,1))

    boxplot(project, which="all", transfo=log2, main="Raw chip fluorescent intensities", las=2)
    boxplot(project.bgcorrect.norm.avg, transfo=log2, main="Background-corrected, RMA normalised, log2 expression values\nAll probes" las=2)
dev.off()

NB - again, do more checks - never assume anything in Bioinformatics. Treble check that the order of files in CELFiles matches the order in Targets.txt, and that both of these match the sample order in the exprs(project.bgcorrect.norm.avg)

Now, perform differential expression:

#Create the study design and comparison model
design <- paste(targetinfo$Group, sep="")
design <- factor(design)
design

comparisonmodel <- model.matrix(~0+design)
colnames(comparisonmodel) <- levels(design)
comparisonmodel

#Fit the linear model on the study's data
#lmfit uses generalized least squares (GLS), which is better for heteroscedastic data (i.e., data where the variances of sub-populations are different)
#The model looks at each probe across the defined sample groups and 'fits' a value, which is then used to gauge statistical inferences
project.fitmodel <- lmFit(project.bgcorrect.norm.avg, comparisonmodel)

#Applying the empirical Bayes method to the fitted values acts as an extra normalization step and aims to bring the different probe-wise variances to common values.
#This ultimately reduces the degrees of freedom of variances (i.e., less individual, different, variances).
project.fitmodel.eBayes <- eBayes(project.fitmodel)
names(project.fitmodel.eBayes)

#Make individual contrasts and fit the new model
Q1_contrastmodel <- makeContrasts(Q1="Disease-Control", levels=comparisonmodel)
Q1_contrastmodel.fitmodel <- contrasts.fit(project.fitmodel.eBayes, Q1_contrastmodel)
Q1_contrastmodel.fitmodel.eBayes <- eBayes(Q1_contrastmodel.fitmodel)

Then view the results with the toptable() function of limma.

You can also draw PCA bi-plots, MA plots, and volcano plots. There's plenty of examples on how to do these on the World Wide Web

ADD REPLYlink modified 3 days ago • written 8 days ago by Kevin Blighe21k

Thank you. I did the same but I'm facing problem with annotation of probes. Kindly check previous comments.

ADD REPLYlink written 8 days ago by vinnu26010

Just download the CSV file from Affymetrix and then do the annotation manually in R. Check the rownames of your exprs() object in R, and use that information to determine on which column in the annotation file you need to match.

Take a look at the match() and which() functions

ADD REPLYlink written 8 days ago by Kevin Blighe21k

Here you go, kind Sir. This is performed after we normalise the data and produce our ExpressionSet object. The rownames of the ExpressionSet object match with the first column of the annotation CSV file.

... ... ...

#Background correct, normalize, and calculate gene expression
project.bgcorrect.norm.avg <- rma(project, background=TRUE, normalize=TRUE, target="core")

#Read in the transcript annotation related to this chip
#The annotation was edited by Kevin Blighe in order to remove all controls probes - we do not want these to be included in modeling/differential expression
#Also removed all non-coding transcripts
Annotation <- read.csv("Annotation/HuGene-2_1-st-v1.na35.hg19.transcript_KB_CodingOnly.csv", header=TRUE, skip=21)

#Subset the expression matrix to eliminate control probes (subset is performed using annotation as lookup)
project.bgcorrect.norm.avg <- exprs(project.bgcorrect.norm.avg)[which(rownames(project.bgcorrect.norm.avg) %in% Annotation[,1]),]

#Replace the old rowname IDs with column 8 of the annotation (gene name)
rownames(project.bgcorrect.norm.avg) <- Annotation[match(rownames(project.bgcorrect.norm.avg), Annotation[,1]), 8]

#Write out the normalised expression values
write.table(project.bgcorrect.norm.avg, "NormalisedCounts.GeneSummarised.tsv", sep="\t", quote=FALSE)

As you can see, I edited the annotation file that I download from Affymetrix's website.

ADD REPLYlink modified 8 days ago • written 8 days ago by Kevin Blighe21k

Ill try to do the same with human 1st array. Thanks

ADD REPLYlink written 7 days ago by vinnu26010
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1195 users visited in the last hour