I am currently working on an old project that was not published years ago and we are trying to resurrect it by mixing it with other data.
The project consists of aproximately 100 microarrays from muscle and liver, meassured in the same individuals. One step is to determine which probes can be considered as being expressed or not, and then, determining what are common between both tissues, to go ahead with other steps.
I am not used to working on microarrays but I have developed the following pipeline:
From raw .CEL files, I have incorporated them to an affy object by using the ReadAffy function in bioconductor. Then, I have calculated the Absent/Present call by usin the mas5calls function, say:
Let's call probeset to the ReadAffy object results by incorporating all the .CEL files to it with phenotypic information
probeset.mas5calls <- mas5calls(probeset) exprs_matrix.mas5calls <- exprs(probeset.mas5calls)
That gives me a matrix with probesets as rows and samples as columns, with a P/M/A code for each probe in each sample, meaning P = Present, or probe expressed, M = Marginal, or probe nearly not expressed, and A = Absent, or probe not expressed.
This probe expression estimatiation is done over raw data, no previous normalisation or filtering done, just over raw .CEL files incorporated to an affy object.
Then, I have filtered out the probes, assuming that if there is some A or Absent in any sample, this gene is not expressed and should be discarded.
I have two questions:
Is this approach correct to estimate the expressed probes in a set of .CEL files? starting from raw .CELs and then computing the mas5calls function to them? Should I do any normalisation/filtering prior to that? Are there any other ways to do it? I assume yes, as I checked online, but this was the most straightforward way I found.
When filtering according to P/M/A, just keeping the probe when a P is present for that probe in n samples, which trheshold should I use? Obviously, if I use the most stringent one, say, only taking probes that have a P in all the samples, the amount of probes I get are less, than, say, if I consider just the 50% of samples, or 60/70/80%. Which could be a proper threshold? I have been told about 80-90%, even 100%, but I am not sure about that...
Is there any other function or way to calculate that?