Question: Does anyone has van't Veer et al. (2002) breast cancer data set?
0
gravatar for lur_murad
22 days ago by
lur_murad0
UK
lur_murad0 wrote:

I am trying to download the van't Veer et al. (2002) breast cancer dataset using breastCancerNKI package on Bioconductor. This is my code

library("breastCancerNKI")
library("affy")
library("Hmisc")
# Load the data
data(nki)
data <- exprs(nki)
dim(data)
# Annotate with gene symbol
 genesymbol <- nki@featureData@data$HUGO.gene.symbol
 row.names(data) <- genesymbol
# Reduce the data set
 data <- data[-whichis.na(row.names(data)) == TRUE),]

in abstract(nki), I notice that there should be "151 had lymph-node-negaitve disease, and 144 had lymph-node-positive disease." However, in exprs(nki), there are 337 observations. It is unclear which are the 144 + 151 = 295 observations. Is there someone has the data with the sample labe. So I can know the samples phenotype? Thanks in advance

microarray breastcancer • 198 views
ADD COMMENTlink written 22 days ago by lur_murad0

Have you considered contacting the authors of the paper?

ADD REPLYlink written 22 days ago by Hussain Ather510

There is not email in the paper!!

ADD REPLYlink written 22 days ago by lur_murad0

Van't Veer's email address is here http://cancer.ucsf.edu/people/profiles/vantveer_laura.3358

ADD REPLYlink written 22 days ago by Hussain Ather510

Thank you for your help @shussainather

ADD REPLYlink written 22 days ago by lur_murad0

This issue appears to go all the way back to 2012 and was never fixed: https://github.com/ramhiser/datamicroarray/issues/5

You may also want to follow-up on GitHub.

ADD REPLYlink written 22 days ago by Kevin Blighe9.0k

I think that it can be explained by the fact that the dataset include patients from 2 studies, some of which may have been used in both [studies]. It's just not clear, though.

On their Bioconductor page, they state that the Van de Vijver dataset was [edit:] 117 patients, which I found through this command:

table(pData(nki)[,3])

 NKI NKI2 
 117  220

Perhaps those other 220 are extra samples used in the second study, by van't Veer. However, it really doesn't add up that much because in the abstract() they also state that all patients were under 53 years old, but there are patients in their 60s in the dataset.

ADD REPLYlink modified 22 days ago • written 22 days ago by Kevin Blighe9.0k

Thank you Kevin I read more than an article they mentioned that the data contain 217 Normal vs 78 Cancer I tried this

annotation<- pData(nki)

I got annotation table but it is not clear how they label the samples (treatment 0 1 2)!!

ADD REPLYlink written 22 days ago by lur_murad0

I think that it is indeed the column #3 ('series') in the pData that is the key. I just found this comment by someone who worked on it (I assume):

Benjamin Haibe-Kains: It was a complicated process of curation and unfortunately we did not keep track of all the changes. However, you can easily identify those samples thanks to the "series" slot in the phenodata: NKI and NKI2 stand for the first and second papers respectively. We removed samples from the first series in priority. I hope this helps. --- Benjamin Haibe-Kains Computational Biology and Functional Genomics Laboratory Center for Cancer Computational Biology Harvard School of Public Health, Dana-Farber

[source: http://grokbase.com/p/r/bioconductor/11asxgmh4j/bioc-breastcancernki-patients-question]

Sounds a bit suspicious to me. How difficult can it be to just make a study's data available to the public? One can only assume that there are other issues relating to the protection of this data that no-one wants to admit. This has happened before when 'important' data has been published.

ADD REPLYlink modified 22 days ago • written 22 days ago by Kevin Blighe9.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1336 users visited in the last hour