Hello world !
I am working on microarray data (Hugene 2.0 st from affymetrix) and I was willing to get probe-level informations.
For one probeset (fsetid), I need all the different probes (fid) and their chromosome, start, stop, sequence etc.
I know I can get this on NetAffx but as I have to check multiple probesets, I wanted to use the library oligo in association with pd.hugene.2.0.st
Here is my code tested on a 'test-list' of 6 .CEL files:
## Library loading
library(oligo)
library(dplyr)
librarypd.hugene.2.0.st)
# .CEL importation and read
celFiles <- list.celfiles('./', full.names=TRUE)
batch = read.celfiles(celFiles)
# Expression matrix tmp
tmp=batch@assayData$exprs
dim(tmp)
# Connexion pd.hugene.2.0.st@getdb() and normalization
conn = pd.hugene.2.0.st@getdb()
norm = rma(batch, background=TRUE, normalize=TRUE, target="core")
# Data importation
tableNames=dbListTables(conn)
tablesList=list()
for (tableName in tableNames){
tablesList[[tableName]]=dbGetQuery(conn,paste('SELECT * FROM',tableName))
print(tableName)
str(tablesList[[tableName]])
}
# Simplification of the tablesList dataframe
testList <- tablesList$featureSet
testList2 <- tablesList$pmfeature
testListComplet <- dplyr::full_join(testList, testList2, by = "fsetid")
# Test on one probeset randomly selected : probeset 16730541
ind = testListComplet$fid[which(testListComplet$fsetid == 16730541)]
tmp[ind,]
probeset_test <- subset(testListComplet, fsetid == 16730541)
probeset_test
Here is the probeset_test result I get :
fid fsetid strand start stop transcript_cluster_id exon_id crosshyb_type level chrom type fid atom x y
153940 16730541 0 102218019 102218094 16730540 5033332 1 NA 11 1 944681 153940 48 586
153941 16730541 0 102218019 102218094 16730540 5033332 1 NA 11 1 2165196 153941 279 1343
So, according to pd.hugene.2.0.st, for probeset 16730541, I have 2 probes : 153940 & 153941 with a common start (102218019) and stop (102218094). But when I check on the NetAffx website ( https://www.affymetrix.com/analysis/netaffx/exon/wtgene_probe_set.affx?pk=712:16730541 ), here are the two probes I get :
atcagcggcgccgacaaggagatac chr11:102218019-102218043 (+)
cagcaaacacggaagctgcgcggct chr11:102218070-102218094 (+)
Did I do something wrong ? Did I misunderstand something ? Could you please enlighten me ?!
I am sorry if this is a stupid question but I could not figure it out by myself :(
Thank you so much in advance.
Here is my R info :
platform x86_64-apple-darwin15.6.0
arch x86_64
os darwin15.6.0
system x86_64, darwin15.6.0
status
major 3
minor 4.0
year 2017
month 04
day 21
svn rev 72570
language R
version.string R version 3.4.0 (2017-04-21)
nickname You Stupid Darkness
Oh, I did not noticed that !! Thank you !!
But, for the purpose of using R code-based analysis, do you know how I could get the right start and stop corresponding to each probe ? (instead of checking on NetAffx every time)
When I have something like 2 probes per probeset, it is easy to proceed by NetAffx but sometimes, I have way more so it gets pretty overwhelming !
I haven't used this array. But it seems to me that the info you are looking for is not in
pmfeature
, but inpmSequence
. Do you have pmSequence in your tablesList?For some clues, see Manipulation of probe-level data at https://github.com/benilton/oligoOld/wiki/Getting-the-grips-with-the-oligo-Package
Also see https://www.bioconductor.org/packages/devel/bioc/vignettes/oligo/inst/doc/oug.pdf
After multiple fails, I finally got every piece of info I needed (with the help of the github protocol and the pdf you mentioned)
So here is the code with all info :
With this result obtained :
Thank you very much for your help and patience !
Kind Regards
Claire