Question: Query About Data Of Breast Cancer Dataset
9.2 years ago by
Dear all,

I have a problem in dealing with a classic breast cancer dataset from the following paper.

A gene-expression signature as a predictor of survival in breast cancer

I downloaded data from here. But when I do the visualization of probe signal, it seems that data has not been normalized and they usually appear around zero which is on the contrary to normal data with mean around 7 (after log2 transformation). I have noticed that this array chip is not standard one and the authors developed by themselves. So I wonder what kind of data processing method should be performed to make the analysis meaningful.

BTW, where could I obtain some documentation for this array data expect read me files accompanied with array matrix from download site. For example, I met a problem in understanding the header of the array data.

"Substance Gene Log Ratio Log Ratio Error p-Value Intensity Flag"

I am not quite sure if it is OK to adopt Log Ration as the signal value, while as for Log Ratio Error p-Value and Intensity, what do they mean?

Thanks a lot and I am looking forward your reply.

Zhe Liu

9.2 years ago by
National Institutes of Health, Bethesda, MD
These data are from a two-color array. You can do some reading on the subject, but two-color arrays were (and to a certain extent) still are a common array design. The Log Ratio column is the column that you would use as the signal and usually represents a log2 ratio of the tumor intensity at each spot to that of a "reference" sample. Two-color array normalization methods include median-centering or loess normalization, though there are others. Having a mean/median for each array near or at zero implies that some normalization might have been done.

If you have detailed questions about these data (after reading the paper), a good place to start is to email the corresponding author.

Hi Sean, Thanks a lot for your soon reply!

