Does anyone has van't Veer et al. (2002) breast cancer data set?
0
0
Entering edit mode
3.4 years ago
lur_murad • 0

I am trying to download the van't Veer et al. (2002) breast cancer dataset using breastCancerNKI package on Bioconductor. This is my code

library("breastCancerNKI")
library("affy")
library("Hmisc")
# Load the data
data(nki)
data <- exprs(nki)
dim(data)
# Annotate with gene symbol
 genesymbol <- nki@featureData@data$HUGO.gene.symbol
 row.names(data) <- genesymbol
# Reduce the data set
 data <- data[-whichis.na(row.names(data)) == TRUE),]

in abstract(nki), I notice that there should be "151 had lymph-node-negaitve disease, and 144 had lymph-node-positive disease." However, in exprs(nki), there are 337 observations. It is unclear which are the 144 + 151 = 295 observations. Is there someone has the data with the sample labe. So I can know the samples phenotype? Thanks in advance

microarray breastcancer • 1.6k views
ADD COMMENT
0
Entering edit mode

Have you considered contacting the authors of the paper?

ADD REPLY
0
Entering edit mode

There is not email in the paper!!

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Thank you for your help @shussainather

ADD REPLY
0
Entering edit mode

This issue appears to go all the way back to 2012 and was never fixed: https://github.com/ramhiser/datamicroarray/issues/5

You may also want to follow-up on GitHub.

ADD REPLY
0
Entering edit mode

I think that it can be explained by the fact that the dataset include patients from 2 studies, some of which may have been used in both [studies]. It's just not clear, though.

On their Bioconductor page, they state that the Van de Vijver dataset was [edit:] 117 patients, which I found through this command:

table(pData(nki)[,3])

 NKI NKI2 
 117  220

Perhaps those other 220 are extra samples used in the second study, by van't Veer. However, it really doesn't add up that much because in the abstract() they also state that all patients were under 53 years old, but there are patients in their 60s in the dataset.

ADD REPLY
0
Entering edit mode

Thank you Kevin I read more than an article they mentioned that the data contain 217 Normal vs 78 Cancer I tried this

annotation<- pData(nki)

I got annotation table but it is not clear how they label the samples (treatment 0 1 2)!!

ADD REPLY
0
Entering edit mode

I think that it is indeed the column #3 ('series') in the pData that is the key. I just found this comment by someone who worked on it (I assume):

Benjamin Haibe-Kains: It was a complicated process of curation and unfortunately we did not keep track of all the changes. However, you can easily identify those samples thanks to the "series" slot in the phenodata: NKI and NKI2 stand for the first and second papers respectively. We removed samples from the first series in priority. I hope this helps. --- Benjamin Haibe-Kains Computational Biology and Functional Genomics Laboratory Center for Cancer Computational Biology Harvard School of Public Health, Dana-Farber

[source: http://grokbase.com/p/r/bioconductor/11asxgmh4j/bioc-breastcancernki-patients-question]

Sounds a bit suspicious to me. How difficult can it be to just make a study's data available to the public? One can only assume that there are other issues relating to the protection of this data that no-one wants to admit. This has happened before when 'important' data has been published.

ADD REPLY

Login before adding your answer.

Traffic: 2507 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6