Hello world !
I am working on microarray data (Hugene 2.0 st from affymetrix) and I was willing to get probe-level informations.
For one probeset (fsetid), I need all the different probes (fid) and their chromosome, start, stop, sequence etc.
I know I can get this on NetAffx but as I have to check multiple probesets, I wanted to use the library oligo in association with pd.hugene.2.0.st
Here is my code tested on a 'test-list' of 6 .CEL files:
## Library loading
library(oligo)
library(dplyr)
librarypd.hugene.2.0.st)
# .CEL importation and read
celFiles <- list.celfiles('./', full.names=TRUE)
batch = read.celfiles(celFiles)
# Expression matrix tmp
tmp=batch@assayData$exprs
dim(tmp)
# Connexion pd.hugene.2.0.st@getdb() and normalization
conn = pd.hugene.2.0.st@getdb()
norm = rma(batch, background=TRUE, normalize=TRUE, target="core")
# Data importation
tableNames=dbListTables(conn)
tablesList=list()
for (tableName in tableNames){
tablesList[[tableName]]=dbGetQuery(conn,paste('SELECT * FROM',tableName))
print(tableName)
str(tablesList[[tableName]])
}
# Simplification of the tablesList dataframe
testList <- tablesList$featureSet
testList2 <- tablesList$pmfeature
testListComplet <- dplyr::full_join(testList, testList2, by = "fsetid")
# Test on one probeset randomly selected : probeset 16730541
ind = testListComplet$fid[which(testListComplet$fsetid == 16730541)]
tmp[ind,]
probeset_test <- subset(testListComplet, fsetid == 16730541)
probeset_test
Here is the probeset_test result I get :
fid fsetid strand start stop transcript_cluster_id exon_id crosshyb_type level chrom type fid atom x y
153940 16730541 0 102218019 102218094 16730540 5033332 1 NA 11 1 944681 153940 48 586
153941 16730541 0 102218019 102218094 16730540 5033332 1 NA 11 1 2165196 153941 279 1343
So, according to pd.hugene.2.0.st, for probeset 16730541, I have 2 probes : 153940 & 153941 with a common start (102218019) and stop (102218094). But when I check on the NetAffx website ( https://www.affymetrix.com/analysis/netaffx/exon/wtgene_probe_set.affx?pk=712:16730541 ), here are the two probes I get :
atcagcggcgccgacaaggagatac chr11:102218019-102218043 (+)
cagcaaacacggaagctgcgcggct chr11:102218070-102218094 (+)
Did I do something wrong ? Did I misunderstand something ? Could you please enlighten me ?!
I am sorry if this is a stupid question but I could not figure it out by myself :(
Thank you so much in advance.
Here is my R info :
platform x86_64-apple-darwin15.6.0
arch x86_64
os darwin15.6.0
system x86_64, darwin15.6.0
status
major 3
minor 4.0
year 2017
month 04
day 21
svn rev 72570
language R
version.string R version 3.4.0 (2017-04-21)
nickname You Stupid Darkness
Oh, I did not noticed that !! Thank you !!
But, for the purpose of using R code-based analysis, do you know how I could get the right start and stop corresponding to each probe ? (instead of checking on NetAffx every time)
When I have something like 2 probes per probeset, it is easy to proceed by NetAffx but sometimes, I have way more so it gets pretty overwhelming !
I haven't used this array. But it seems to me that the info you are looking for is not in
pmfeature, but inpmSequence. Do you have pmSequence in your tablesList?For some clues, see Manipulation of probe-level data at https://github.com/benilton/oligoOld/wiki/Getting-the-grips-with-the-oligo-Package
Also see https://www.bioconductor.org/packages/devel/bioc/vignettes/oligo/inst/doc/oug.pdf
After multiple fails, I finally got every piece of info I needed (with the help of the github protocol and the pdf you mentioned)
So here is the code with all info :
With this result obtained :
Thank you very much for your help and patience !
Kind Regards
Claire