Question

Secondary Filtering of DE Results to Eliminate Lowly-Expressed Transcripts

0

Entering edit mode

2.7 years ago

mvanhorn • 0

Hello,

I'm currently trying to analyze an RNA-seq data set of about 10500 samples (including biological replicates) and initially noticed some strange outputs in my differential expression output file. I am using the CIRIquant pipeline with the built-in differential expression function which uses the edgeR module to calculate DE circular RNA. There were only six DE circRNA, which I initially assumed to be because of the larger sample size. However, when scrolling through the output, I saw that over half (646,400 or 61%) of the circRNAs listed had a logFC = 0, PValue = 1, and FDR = 1.

After contacting the developer, they suggested that this is due to many circRNAs being expressed in only a few samples, so that the average expression is zero in both the experimental and control group, and at the same time, that the large number of circRNAs means that it is hard for the DE circRNAs to pass FDR correction. They suggested filtering the DE results based on occurrence and expression level to reduce the total number of circRNA, and then re-calculate the FDR values.

I am trying to follow their advice and use edgeR for a secondary filtering, but was unable to use the filterByExpr() function as my data has already been processed within the DE portion of the CIRIquant pipeline and I cannot find a way to retrieve count data or library sizes. I was also looking at using rowSums() and specifying a minimum logCPM to use as a filter but I don't know if this is the best method or if there is a standard minimum logCPM to use.

Does anybody have any suggestions on how to filter already-processed DE results or what tools/functions I should be using?

Thank you for your time & any advice!

RNA-seq edgeR • 1.3k views

ADD COMMENT • link updated 2.7 years ago by LChart 5.2k • written 2.7 years ago by mvanhorn • 0

score 1 · Answer 1 · 2023-03-07

1

Entering edit mode

2.7 years ago

LChart 5.2k

Yes; in point of fact there is a common methodology termed independent filtering that can be applied in this setting. You can choose any statistic that is independent of the class labels, such as the average logCPM, or the variance, or both, or neither. You can set these thresholds a priory or set a target number of genes passing the filters a prioiri and adjust the thresholds to match. It then suffices to correct for multiple tests only for genes passing your thresholds; and the type-1 error is still controlled.

What you can not do is adjust the thresholds so that they maximize the number of FDR-significant genes; this will necessarily inflate the type-1 error.

If you plot a histogram of mean expression of circRNA, what does it look ilke? I would imagine as few as 5% are well expressed -- you have over 1,000,000 circRNA quantified; and i doubt that circRNA are expressed at a level comparable to mRNA of which typically 12K-20K are robustly quantifiable. As such I'd tune my filters to cut down to 10k or 20k circRNA; set the FDR for the remaining to NA; and re-compute the FDR on my chosen set.

ADD COMMENT • link 2.7 years ago by LChart 5.2k

0

Entering edit mode

Thank you for your reply, the description of independent filtering was very insightful.

I tried plotting a histogram of the average logCPM, hoping for a bimodal distribution that I could use for a cut-off point. However, the scale is such that I am not able to see much of any potentially well-expressed circRNA. I am a newer user to R and haven't figured out if there is a way to manipulate the scale appropriately, but I can see manually looking at the average logCPM values that I do have a small population approaching an average logCPM of zero.

Here is what the histogram looks like: enter image description here

Can you explain how to tune filters to set a target number of circRNA/genes remaining in a population? Is it as simple as sorting by descending average logCPM and removing all genes below a certain row number (10k-20k)?

Thank you again!

ADD REPLY • link 2.7 years ago by mvanhorn • 0

0

Entering edit mode

Assuming an n_gene x n_sample expression matrix, you should be able to do something like avgCPM <- apply(expr, 1, mean); exprFilt <- expr[rank(-avgCPM) <= 10000,]

ADD REPLY • link 2.7 years ago by LChart 5.2k