I'm hoping to get some clarity on the best approach for analyzing my RNA-seq data. I've have read dozens of posts and articles on this subject but I feel it's made me more confused than when I started!
The background for my project is that I have RNA-seq data for the same rat tissue type in males and females. What I would like to do is look at the difference between what transcripts are present between males and females. Ie I'd like to see if known transcripts are present as well as any novel transcripts/splice variants in each sample and then compare them between my male and female tissues to see if I get different splice variants etc between the tissues.
I understand that there are loads of different algorithms out there each with their own advantages and disadvantages and each can be run in different modes. I have read that it's a good idea to run both de novo and genome-guided assemblies to get the most reliable results. (Combining de novo and genome guided transcriptome assembly for expression analysis?)
For genome-guided assembly, so far I have run Cufflinks (for this I supplied bam files previously aligned to the genome with subread as well as the known transcriptome gtf file for rat_rn6). This seems to give me what I'm looking for in that it gives me a gtf file that documents both known transcripts as well as seemingly novel transcripts/splice variants. However, I understand that Cufflinks is not the best algorithm. For that reason I have also run Stringtie again providing my aligned bam and the rat-rn6.gtf transcriptome file. This also gives me an output gtf file but this does not seem to output known transcripts as Cufflinks does, I seem to just get loads of small fragments per gene which is unlike what is shown in the stringtie paper. See image attached for comparison between Cufflinks and Stringtie results. I tried Scripture too but it seems to give me errors which I can't seem to fix when trying to run the whole genome through the latest version of the algorithm. For de novo assembly, I am currently in the process of running Velvet/Oases as an example of de novo assembly (takes a long time) and was considering running IDBA-Tran and SOAPdenovo-Trans also. For these I am only supplying my raw fastq reads. I am unable to use Trinity due to lack of linux or a server and because Galaxy Trinity doesn't seem to work properly.
So my (somewhat general) questions are:
- Am I applying the right algorithms to get my answer? Is running 2 x genome-guided (Cufflinks and Stringtie) and 3 x de novo (Velvet/Oases, IDBA, SOAP) enough or too much for my question?
- Once I have the results from all these different algorithms, how can I interpret them to tell me whether I have known and novel common transcripts in my samples?
- Is there an alternative to Stringtie that runs in a similar genome-guided fashion to Cufflinks or is there a way to get Stringtie to output what I want? This is so I have a couple of examples of transcripts obtained in a genome-guided fashion vs de novo.
- Are there any examples of workflows or pipelines out there that address what I'm trying to address? I have a well annotated genome, I have an annotated transcriptome and I want to know if my samples contain these known transcripts as well as extras.
I'm a bit overwhelmed and confused. A lot of posts/answers are related to analyses with no reference genome. Not my case.