Question

RNA-sequencing dataset analyzed twice: good and poor estimate of false-positive rates

0

Entering edit mode

6.5 years ago

GreenDiamond ▴ 70

I am trying to find an example where an RNA-sequencing dataset was analyzed in one manner that lead to a poor false-positive rate estimation, and then analyzed in another manner that lead to a good false-positive rate estimation. Ideally, both analysis methods led to DEGs: Only in the first case, these DEGs are not as trust-worthy as in the second case because of the lower false-positive rate estimation.

I am not concerned about what package was used for the analysis or what type of RNA-sequencing dataset.

In case this is a strange question to ask... I am hoping to examine visually what DEGs "look like" when they come from an analysis that has a poor false-positive rate estimation versus a good false-positive rate estimation.

If anyone is aware of any such example code that could produce such items, I would be very grateful to be pointed in that direction! Thank you.

RNA-Seq false-positive • 1.3k views

ADD COMMENT • link updated 6.5 years ago by ivivek_ngs ★ 5.2k • written 6.5 years ago by GreenDiamond ▴ 70

score 0 · Answer 1 · 2017-10-04

0

Entering edit mode

6.5 years ago

ivivek_ngs ★ 5.2k

I would like to highlight certain things that might help in your search,

when you say like trust-worthy DEGs they either have to be confirmed by qPCR or nanostring or gold standard DEGs. Probably you can then search for benchmarking papers that use the same set of datasets but apply different aligners, quantification techniques and DE analysis tools to find the TP and FP in the DEGs set.
Having said that one cannot be sure unless most of these DEGs are validated, then it makes a better claim or if these DEGs comes under a study where they have been used widely as gold standard.
So I will point out to some benchmark studies and you can take a look if they help.

Link 1

Link 2

For the start possibly they might help. Let me know if it is working for you. Thanks

ADD COMMENT • link 6.5 years ago by ivivek_ngs ★ 5.2k

0

Entering edit mode

Thank you very much for these links and words of advice!

As I said in my original post: "I am hoping to examine visually what DEGs 'look like' when they come from an analysis that has a poor false-positive rate estimation versus a good false-positive rate estimation.". I should have added a sentence afterward to be more clear of what I am hoping to obtain in the end:

1) The read counts for the DEGs that have evidence for high false-positive rate. 2) The read counts for the DEGs that have evidence for low false-positive rate.

I enjoyed reading those two links you sent! The paper seemed promising because there are indeed big differences between the false-positives between, say, lr and limma.voom (as seen in their Table 1). However, I have been trying to obtain the DEGs for any set of parameters (say n=3 and p=10^-4). I used their code to follow syntax similarly, but am getting strange results. For instance, when I run limma.voom using their parameters, I get all of them being DEGs!

The RNAontheBENCH package was really neat to read! I started by running their deNanoString() function on the data exactly from their vignette. I did so with all option combination types (edgeR/DESeq etc. with upperquartile/TMM etc). However, almost all of them gave a pretty high AUC (all were about 0.85). I am trying to get an example of really terrible DEG calls.

So, I then tried running their deSpikein() function on the data exactly from their vignette. I again did so on all option combination types (edgeR/DESeq etc. with upperquartile/TMM etc). This time the range of AUC values in the ROC curves from the generated plots were between 0.371 and 0.893, so I was pretty excited. I am now stuck trying to obtain the information I need (listed above as (1) and (2) above).

Essentially, I simply ran these commands:

data(exampledata)
resHigh=deSpikein(exampleGeneLevel, method="voom", norm="TMM", quantification="Tophat-featureCounts")
resLow=deSpikein(exampleGeneLevel, method="t", norm="linear", quantification="Tophat-featureCounts")

resHigh and resLow both have a "log2FC" and "p" listed for 92 ERCC spike-ins. resHigh (which had AUC=0.893) has lower "p" values than resLow (which has AUC=0.371). I am now trying to now simulate a usual experiment, only now I know that one will lead to low quality DEGs and the other will lead to high quality DEGs. So now, I just hope to obtain the rows of the DEGs from the exampleGeneLevel (15583 rows) for both the "voom"/"TMM" method and the "t"/"linear" method. This causes me two problems:

1) I do not know what the treatment groups are (if there are any) for exampleGeneLevel? To obtain what I am looking for, I am assuming maybe there are 2 treatment groups of 6 replicates in the 12-sampled exampleGeneLevel? I tried reading the original paper and looking at NCBI, but I cannot figure out which samples belong to which group.

2) If indeed there are two groups in exampleGeneLevel, it seems I may have to write this code on my own. I could use the limmaVoom vignette on the exampleGeneLevel data and make sure I use the "TMM" normalization. I am not too sure how to run the "t" method on the exampleGeneLevel data, to be honest. Either way, the problem is I could code something that is different than what the package did to calculate the AUC values; ideally, I should keep it as similar as possible. As a result, I am wondering if I might be missing something in the package that provides this code?

This became a long post, and I apologize for that. If you have any further advice, I would greatly appreciate hearing it. Thank you again for these references.

ADD REPLY • link 6.5 years ago by GreenDiamond ▴ 70