Question

How to do a benchmark analysis of RNA-Seq tools ?

1

Entering edit mode

4.9 years ago

vin.darb ▴ 300

Hi !

I'm working on a,RNA-Seq data analysis, whose experimental plan is as follows:

sample  phenotype traitement
3 X Col control normal
3 X Col_HS control heat-shock
3 X mutant mutant normal
3 X mutant_HS mutant heat-shock

The main objectif is to discover the effect of the mutation on the transcription and the analyse control vs heat-shock serves me as a control for the quality of the analyzes because the effects of heat-stress have been well studied. However I don't know which approach used to validate my analysis. I retrieved a list of 12 genes experimentally validated as differentially expressed in heat-shock, and I looked if I found them in my experience.

I find 9 of 12 but as I have more than 1000 differentially expressed genes I don't know if it's good or not.

My question is, how do you judge the pertinence of an analysis? I try different tools and I would like to make a comparison of these tools to choose the best. How can I do ?

Thanks

transcription splicing RNA-Seq • 1.6k views

ADD COMMENT • link updated 10 months ago by Ram 43k • written 4.9 years ago by vin.darb ▴ 300

1

Entering edit mode

However I don't know which approach used to validate my analysis.

Shouldn't the validation be done by an independent experimental method? You are only generating hypothesis to test by data analysis.

ADD REPLY • link 4.9 years ago by GenoMax 141k

0

Entering edit mode

Thank's for all the answers , in fact I test different methods for detect alternative spliced genes between two conditions (DEXeq, rMATS, RATS ..) and saw that there were several tools that made it possible to generate simulated reads (such as polyester) from parameters specific to an rna-seq dataset (like sequencing coverage, reads length, etc.) I wanted to test it to then calculate the true positives rate , the false positives rate, etc. This would allow me to have validation on a simulated dataset in addition to experimentally validated genes that I retrieve in my list of genes Do you think it's a good approach?

ADD REPLY • link 4.9 years ago by vin.darb ▴ 300

2

Entering edit mode

Did you search the literature? For example:

Comparison of RNA-seq and microarray platforms for splice event detection using a cross-platform algorithm

Comparison of Alternative Splicing Junction Detection Tools Using RNA-Seq Data

edit: by the way, I agree with genomax and believe proper validation is confirming these results with different methods on (preferentially) a different dataset. You can probably decrease the number of false positives with the intersection of the results of different tools over one dataset, but this is not validation.

ADD REPLY • link 4.9 years ago by h.mon 35k

2

Entering edit mode

Good point - I didn't previously notice the "splicing" tag, so I revised by answer about gene expression programs.

I also personally like QoRTS+JunctionSeq (and sometimes SGSeq for specific genes). While JunctionSeq has a gene-level metric (and overall gene plots), I think it is mostly comparing exons and junctions (with different dispersions). So, describing "exon" or "junction" counts (rather than "gene" counts) may help get answers more specifically related to splicing analysis.

However, I believe those are a little harder to benchmark, and I think the splicing analysis may require extra work to assess your results (and I think there is extra variability, such that having more replicates may be relatively more important). Nevertheless, sometimes even Sashimi plots in IGV can be useful for a gene of interest, even without any differential exon / splicing analysis.

ADD REPLY • link 4.9 years ago by Charles Warden 8.2k

0

Entering edit mode

I did not use junctionseq, I'll go see that thank's

Some experiments will be do to confirm some genes of interests (as said h.mon and genomax). I have already use IGV to visualise some AS genes but if I wanted to benchmark the tools is that in parallel of the analyzes I must write a report for my master's degree, explaining my strategies and why I have use some parameters, threshold etc

So my first idea was to use artificial generate reads (with more replicates for exemple) and so to know the differentially spliced genes in relation to this dataset, and then to use the tools by varying the parameters to maximize the true positive rates

But if you tell me that alternative splicing studies are difficult to benchmark, I will maybe not waste my time aha

Thank's for all the advices anyway

ADD REPLY • link 4.9 years ago by vin.darb ▴ 300

0

Entering edit mode

In a sense, the "benchmark" is trying to find the best way to represent your data (and validate your own results). So, in that sense, I think it is important.

However, notes in your paper's methods should probably be OK (compared to spending time for a separate benchmark paper, which may or may not represent the best thing for somebody else to do).

Best of luck!

ADD REPLY • link 4.9 years ago by Charles Warden 8.2k

score 2 · Answer 1 · 2019-05-30

Please use the search function and especially google to find benchmarking papers and blogs. See for example this blog from Mike Love, developer of DESeq2 on a comparison of edgeR and DESeq2. Do not try to do custom comparisons unless you have expert knowledge on the underlying statistics and the way tools handle data internally, there are simply too many pitfalls that can skew the comparison. I recommend looking into the three kost popular tools (edgeR, DESeq2, limma). Go through the papers and manuals, check which assumptions they have towards the data and eventually choose the one you feel most comfortable with. I personally (currently) use the pipeline: read quantification with salmon, gene level summarization with tximport and DEG analysis with edgeR. All three tools have outstanding documentation and the developers are responsive on BioC or here on Biostars.

score 1 · Answer 2 · 2019-05-30

For gene expression, I agree with the recommendation to try edgeR / DESeq2 / limma-voom.

However, I think you frequently need to assess the bioinformatics methods, such that you should be running benchmark comparisons for each project. If you have large numbers of differentially expressed genes, it might even be worthwhile to see if using a standard statistical analysis (ANOVA, linear regression via lm(), etc.) on log2(FPKM + 0.1) values helps (more so than increasing the FDR cutoff).

I think having an independently calculated expression value (such as the most direct FPKM calculation, with either aligned reads or quantified reads) is a good way to access your results.

I don't currently have a paper to show that you can't lock-down your methods ahead of time, but I think this is the last response that I gave to a similar question (in terms of accessing your results, using biological knowledge and basic principles):

A: RNA-seq dispersion estimation

I also have these acknowledgements as a placeholder, but please be aware that I unfortunately can't support use of that code (in part because I have to go in and modify the code for each project, and in part because I need to work on figuring out a sustainable workload for myself, where I can focus on fewer projects more in-depth). Nevertheless, I think at least some papers show that you needed to use different p-value methods (even within one paper), so perhaps that can help. If the papers aren't directly linked in the acknowledgements, you can also view my Google Scholar profile for recent publications.