Hello, I'm interested on replicating the analysis made on the paper "Pancreatic cancer circulating tumour cells express a cell motility gene signature that predicts survival after surgery." . I'm having trouble understanding the filtration made explained like this:
Gene expression analysis
For 46,467 probes, fewer than 4 out of 6 CTC samples gave present detection calls and therefore these probes were omitted from further analysis (Additional file 1). We retained a final set of 8,152 probes for statistical analysis
I'm a biologist starting his bioinformatic career, so I don't fully grasp how looking at the SimpleAffy QC graph from the Additional file 1 explains their choice.
I hope this isn't too much asking but any help would be welcome. If you could share your criteria and script used to end up with the 8.152 probes for the analysis that would help me a lot.
PS: I already wrote to the authors but the emails provided are no longer in use.
Kind regards,
Maxi
The additional file's content is poor quality and not entirely related to the statement that you have quoted, even if the authors are linking them together. If I were a reviewer, I would have flagged this up because they do [edit] not explicitly state the criteria for filtering out these probes. Based on the wording, though, you could assume that they filtered out 46467 probes whose expression levels fell below background threshold in 4 or more of the CTC samples.
SimpleAffy is a packge, by the way: https://www.bioconductor.org/packages/release/bioc/html/simpleaffy.html
Note that the MAS5 normalisation method returns a
present
/absent
flag for probes, but these authors have not stated whether they used MAS5 or not (?)At the section: Microarray data analysis
"To decide whether a signal was significantly above background, the MAS 5.0 algorithm was applied to calculate probe set detection calls. "
Cool, that makes sense, in that case. MAS5 implementation is different depending on whether you are using the oligo or affy package, though.
Hi Kevin, thanks for your answer. I think I found the filtering made on the paper explained on this one https://f1000research.com/articles/5-1384/v2
Under the section: Filtering based on intensity
I cut down from ~46k probs to ~27k but with the thresholds I assumed were correct. What do you think?
Thanks for your time!
I am familiar with that paper and it is well known. I see no issue in following it and then just citing it in your own work.