Question

Help with analyzing agilent microarray data

0

Entering edit mode

9.6 years ago

krishnakarthik.vemuri ▴ 10

Hello all,

I have asked this question here a few weeks ago, but as I had provided only a limited amount of information about my dataset, I got a semi-useful answer. Here is my question, in more detail.

I have the result of an agilent microarray assay, possibly a 2 channel microarray, in Excel format. This is an example of how the dataset looks:

CLID                                    NAME   GWEIGHT    Pat1A      Pat1B    Pat2A    Pat2B
AGI_HUM1_OLIGO_A_23_P100001                    1          0.331      1.144    -1.165   -0.952
AGI_HUM1_OLIGO_A_23_P100011                    1          -0.254     -0.068   -0.091   0.511
AGI_HUM1_OLIGO_A_23_P10002                     1
AGI_HUM1_OLIGO_A_23_P100022                    1          3.503      2.595    3.612    3.776
AGI_HUM1_OLIGO_A_23_P100033                    1
AGI_HUM1_OLIGO_A_23_P100056                    1          0.565      0.102    1.449    1.718
AGI_HUM1_OLIGO_A_23_P100059                    1
AGI_HUM1_OLIGO_A_23_P100065                    1
AGI_HUM1_OLIGO_A_23_P100074                    1          -0.236     -0.219   0.709    0.792

The experiment is whole genome analysis of paired human samples. The only other piece of information I have is that data has been log-transformed and normalized.

I have been planing to use the Bioconductor package to perform the analysis, but I am at a loss as to how to go about doing this. Both the LIMMA and the AGILP packages use the output for the Red and Green channels from the Agilent Feature Exrtaction software as inputs, as far as I can tell, and I don't have them.

I am reading this data as a composite of the output from the Red and green channels for each sample. Is that correct? Or is it a one-channel array?
I am still assuming that using the Bioconductor package is the right way to go, but how do I enter the data into R such that it can be utilized by one of the functions from the package? Or if there is another package to be used, can you suggest that?
Another problem I have been having is the annotation of the probes. I am not sure where to get the annotation data. I have looked at the annotation package in Bioconductor, but I am again not sure which data to use.
A fourth problem I am having is that, as a beginning, I have read the data into base R as a text file and been able to view it as a data frame. I have tried to perform a row-wise paired t-test using the multtest command in the genefilter package, but that ended up crashing R every time. I want to use the sapply function from the plyr package. Any ideas how to do it? I have tried to use a do loop to perform the t-test on each row of observations, but I ran into problems when some of the row observations were NAs.
As an aside, I have been able to identify this kind of data file as a pre-clustered file, which only adds to my confusion. I have no idea what that kind of a file means or what information it conveys, or how it should be analyzed or what packages to use for it.

I know it is a long post and a lot of questions but I have tried to provide as detailed a question as possible so that I can get some useful answers. I would really appreciate everybody who took the time to provide me some answers to these questions. If some useful literature and reading material can be suggested, I would be grateful for that too.

Cheers,
Krishna

agilent microarray bioconductor • 5.3k views

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 9.6 years ago by krishnakarthik.vemuri ▴ 10

Ram · Answer 1 · 2014-09-07

Before answering your questions, I have just a couple of general comments. First, if these data are integral to your work, it is pretty important to do the very tedious task of sorting out how the data were generated and what they represent. This will often require emails to folks who will not respond for several days or even weeks after repeated attempts, etc. However, for your results to be trustworthy, understanding the data provenance as fully as possible is really necessary. Just guessing and assuming is sometimes required by circumstances, but it is certainly inferior to really knowing. Second, some of your questions above hint at limited R experience (but I could be very wrong on this). I always recommend to folks using R to find a local person to discuss problems or thoughts with.

Now, to answer your questions directly:

The only way to truly answer your question is to ask the producer of the data. That said, if the median of each sample is near zero, then the data were likely two-channel. In the end, one- or two-channel data (assuming a common reference) will be analyzed using identical code in limma.
Limma will happily work with a matrix. Assuming that you can read the data and transform to a matrix, you should be ready to go.
Again, you'll have to sort out what array platform was used to produce the data and go from there. If this is a commercial array (and it appears it is), then finding the catalog name/number is the first place to start. Asking the person who provided the data is probably a good way to go with this one, also.
You can 1) use functionality that is robust to NAs, 2) exclude all rows with NAs, 3) impute the values for NAs. I would suggest using limma for testing as it is designed with small sample sizes in mind and is pretty performant. As I mentioned above, converting the data to a matrix is the easiest way forward.
I am not sure how we can help you on this one as we don't know what a "pre-clustered" file is. Again, it is best to contact the source of the data for the details.