I have RNA-Seq data coming from human tissue samples, and I am interested in finding fusion transcripts specific to the disease case. There are a bunch of software out there that report fusion transcripts and I have come across the following two reviews (by the same group) that compare the different tools -
The problem is that there seems to be no indication as to which is the best and many of the tools seem to perform very well on one dataset and then quite poorly in another. Also there is VERY low consensus/overlap in the output of the different tools.
So, I wanted to know if anyone who has worked on this has any idea for a good tool for this purpose or any general insight about fusion transcript detection.
I agree to prev. comment from ATpoint.
Nonetheless, I have used STAR-Fusion and here are some points to keep in mind -
You should have good sequencing depth in your RNA-seq library to confidently detect fusions. As a rough estimate, 60M reads or more of 100x2 PE data is a good starting point. In case your data is (ribo-depleted) total RNA (and not poly-A selected), then much larger library size should be needed.
Complement the tool you have selected, like STAR-Fusion, with another tool that uses a different strategy. Like JAFFA which can use a hybrid approach of mapping plus assembly. Or, like Pizzly which is based on pseudo-alignment.
If you have a fusion candidate that has good read-depth support, there is fair chance of it being picked up by other tools as well. But this is not granted.
That brings to the last point: have a hypothesis when looking at the results. If gene X has fusion detected and you suspect that fusion is 'activating' the gene, then you expect the main protein domain (of gene X) to have been retained in the fusion. Also, if the fusion junction is using known splice-site(s), there is better chance of it being biologically relevant.
<sarcasm> Welcome to the beautiful world of bioinformatics. </sarcasm> But seriously, this is a common problem in bioinformatics. The big problem is that datasets are complex and there is no gold standard benchmark that can capture all edge cases one might encounter in different datasets. Different software may produce very different results depending on the dataset and its quality. If you can go and try different software, collect promising candidates and validate in the lab. STAR-Fusion is often used from what I know (https://github.com/STAR-Fusion/STAR-Fusion/wiki) but I have no hands-on experience.
thanks for the comment and for STAR-fusion