Does anyone has van't Veer et al. (2002) breast cancer data set?
0
0
Entering edit mode
3.4 years ago

I am trying to download the van't Veer et al. (2002) breast cancer dataset using breastCancerNKI package on Bioconductor. This is my code

library("breastCancerNKI")
library("affy")
library("Hmisc")
data(nki)
data <- exprs(nki)
dim(data)
# Annotate with gene symbol
genesymbol <- nki@featureData@data\$HUGO.gene.symbol
row.names(data) <- genesymbol
# Reduce the data set
data <- data[-whichis.na(row.names(data)) == TRUE),]


in abstract(nki), I notice that there should be "151 had lymph-node-negaitve disease, and 144 had lymph-node-positive disease." However, in exprs(nki), there are 337 observations. It is unclear which are the 144 + 151 = 295 observations. Is there someone has the data with the sample labe. So I can know the samples phenotype? Thanks in advance

microarray breastcancer • 1.6k views
0
Entering edit mode

Have you considered contacting the authors of the paper?

0
Entering edit mode

There is not email in the paper!!

0
Entering edit mode
0
Entering edit mode

Thank you for your help @shussainather

0
Entering edit mode

This issue appears to go all the way back to 2012 and was never fixed: https://github.com/ramhiser/datamicroarray/issues/5

You may also want to follow-up on GitHub.

0
Entering edit mode

I think that it can be explained by the fact that the dataset include patients from 2 studies, some of which may have been used in both [studies]. It's just not clear, though.

On their Bioconductor page, they state that the Van de Vijver dataset was [edit:] 117 patients, which I found through this command:

table(pData(nki)[,3])

NKI NKI2
117  220


Perhaps those other 220 are extra samples used in the second study, by van't Veer. However, it really doesn't add up that much because in the abstract() they also state that all patients were under 53 years old, but there are patients in their 60s in the dataset.

0
Entering edit mode

Thank you Kevin I read more than an article they mentioned that the data contain 217 Normal vs 78 Cancer I tried this

annotation<- pData(nki)


I got annotation table but it is not clear how they label the samples (treatment 0 1 2)!!

0
Entering edit mode

I think that it is indeed the column #3 ('series') in the pData that is the key. I just found this comment by someone who worked on it (I assume):

Benjamin Haibe-Kains: It was a complicated process of curation and unfortunately we did not keep track of all the changes. However, you can easily identify those samples thanks to the "series" slot in the phenodata: NKI and NKI2 stand for the first and second papers respectively. We removed samples from the first series in priority. I hope this helps. --- Benjamin Haibe-Kains Computational Biology and Functional Genomics Laboratory Center for Cancer Computational Biology Harvard School of Public Health, Dana-Farber

[source: http://grokbase.com/p/r/bioconductor/11asxgmh4j/bioc-breastcancernki-patients-question]

Sounds a bit suspicious to me. How difficult can it be to just make a study's data available to the public? One can only assume that there are other issues relating to the protection of this data that no-one wants to admit. This has happened before when 'important' data has been published.