Hello,
I'm currently trying to analyze an RNA-seq data set of about 10500 samples (including biological replicates) and initially noticed some strange outputs in my differential expression output file. I am using the CIRIquant pipeline with the built-in differential expression function which uses the edgeR module to calculate DE circular RNA. There were only six DE circRNA, which I initially assumed to be because of the larger sample size. However, when scrolling through the output, I saw that over half (646,400 or 61%) of the circRNAs listed had a logFC = 0, PValue = 1, and FDR = 1.
After contacting the developer, they suggested that this is due to many circRNAs being expressed in only a few samples, so that the average expression is zero in both the experimental and control group, and at the same time, that the large number of circRNAs means that it is hard for the DE circRNAs to pass FDR correction. They suggested filtering the DE results based on occurrence and expression level to reduce the total number of circRNA, and then re-calculate the FDR values.
I am trying to follow their advice and use edgeR for a secondary filtering, but was unable to use the filterByExpr() function as my data has already been processed within the DE portion of the CIRIquant pipeline and I cannot find a way to retrieve count data or library sizes. I was also looking at using rowSums() and specifying a minimum logCPM to use as a filter but I don't know if this is the best method or if there is a standard minimum logCPM to use.
Does anybody have any suggestions on how to filter already-processed DE results or what tools/functions I should be using?
Thank you for your time & any advice!
Thank you for your reply, the description of independent filtering was very insightful.
I tried plotting a histogram of the average logCPM, hoping for a bimodal distribution that I could use for a cut-off point. However, the scale is such that I am not able to see much of any potentially well-expressed circRNA. I am a newer user to R and haven't figured out if there is a way to manipulate the scale appropriately, but I can see manually looking at the average logCPM values that I do have a small population approaching an average logCPM of zero.
Here is what the histogram looks like:
Can you explain how to tune filters to set a target number of circRNA/genes remaining in a population? Is it as simple as sorting by descending average logCPM and removing all genes below a certain row number (10k-20k)?
Thank you again!
Assuming an
n_gene x n_sample
expression matrix, you should be able to do something likeavgCPM <- apply(expr, 1, mean); exprFilt <- expr[rank(-avgCPM) <= 10000,]