I'm hoping to get some clarity on the best approach for analyzing my RNA-seq data. I've have read dozens of posts and articles on this subject but I feel it's made me more confused than when I started!
The background for my project is that I have RNA-seq data for the same rat tissue type in males and females. What I would like to do is look at the difference between what transcripts are present between males and females. Ie I'd like to see if known transcripts are present as well as any novel transcripts/splice variants in each sample and then compare them between my male and female tissues to see if I get different splice variants etc between the tissues.
I understand that there are loads of different algorithms out there each with their own advantages and disadvantages and each can be run in different modes. I have read that it's a good idea to run both de novo and genome-guided assemblies to get the most reliable results. (Combining de novo and genome guided transcriptome assembly for expression analysis?)
For genome-guided assembly, so far I have run Cufflinks (for this I supplied bam files previously aligned to the genome with subread as well as the known transcriptome gtf file for rat_rn6). This seems to give me what I'm looking for in that it gives me a gtf file that documents both known transcripts as well as seemingly novel transcripts/splice variants. However, I understand that Cufflinks is not the best algorithm. For that reason I have also run Stringtie again providing my aligned bam and the rat-rn6.gtf transcriptome file. This also gives me an output gtf file but this does not seem to output known transcripts as Cufflinks does, I seem to just get loads of small fragments per gene which is unlike what is shown in the stringtie paper. See image attached for comparison between Cufflinks and Stringtie results. I tried Scripture too but it seems to give me errors which I can't seem to fix when trying to run the whole genome through the latest version of the algorithm. For de novo assembly, I am currently in the process of running Velvet/Oases as an example of de novo assembly (takes a long time) and was considering running IDBA-Tran and SOAPdenovo-Trans also. For these I am only supplying my raw fastq reads. I am unable to use Trinity due to lack of linux or a server and because Galaxy Trinity doesn't seem to work properly.
So my (somewhat general) questions are:
- Am I applying the right algorithms to get my answer? Is running 2 x genome-guided (Cufflinks and Stringtie) and 3 x de novo (Velvet/Oases, IDBA, SOAP) enough or too much for my question?
- Once I have the results from all these different algorithms, how can I interpret them to tell me whether I have known and novel common transcripts in my samples?
- Is there an alternative to Stringtie that runs in a similar genome-guided fashion to Cufflinks or is there a way to get Stringtie to output what I want? This is so I have a couple of examples of transcripts obtained in a genome-guided fashion vs de novo.
- Are there any examples of workflows or pipelines out there that address what I'm trying to address? I have a well annotated genome, I have an annotated transcriptome and I want to know if my samples contain these known transcripts as well as extras.
I'm a bit overwhelmed and confused. A lot of posts/answers are related to analyses with no reference genome. Not my case.
I recommend you this paper: Informatics for RNA Sequencing: A Web Resource for Analysis on the Cloud http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004393
In relation to your questions:
I work with RNA-seq Data and have never used these approaches. working with linux in a server really makes everything easier. I have aways aligned with STAR or Trinity (De Novo), quantified using Kallisto/RSEM and analysed using Bioconductor tools (R).
I have see people using velvet for genome but never saw for transcriptomics. And I am not a fan of Cufflinks.
Thank you for the reply. I took a look at the paper you referenced but it seems to be a little vague/skim over the bit that I'm most interested in. There also seems to be a lot of emphasis on Cufflinks in there. I'm actually at this stage not too concerned about the quantification of the different transcripts, at this stage I'd just like to visualize the data in a genome browser (as i have just a couple of genes that we're really interested in to focus in on). Just to clarify, I assume you use STAR for genome-guided and Trinity for De novo assembly of transcripts? And then do you combine the results of the 2 somehow to get a final list of transcripts that you have confidence are real? With your approach, do you get out of the end a set of transcripts that are known and also some that are novel? Presumably you have to take your De novo transcripts and map them back on to known transcripts somehow?
I would recommend IGV or Golden Helix Genome Browse, you can load your BAM file in both, the problm with golden helix is that you cant add your own reference, so if you do not work with mouse or human is better use IGV. Using the UCSC genome browse, the best way is to producing bigWig files, put it in a http server and load the tracks (this can be little overwhelming). There is also a good R package for visualization named gviz.
I only used STAR for mouse since the genome is very well annotated. I use trinity for some neglected disease without good genome reference.
If you want to map the trinity output back into your reference you can use BLAT.
I use StringTie without the
-eoption to enable known as well as novel transcript assembly. And it works well in picking out both. At least the known ones I can confirm. And for a cell line we had knocked out a gene by CRISPR, StringTie could assemble the affected transcript isoform. The cmd and the assembled isoforms -
(input is STAR aligned)
Top blue is the canonical form, green is the control cell line and the next two are effect of knock-down
Thank you for the reply. Do you have any experience using the newer version of Stringtie? It's just I seem to have used exactly the same parameters/commands as you and it gives me that output you see in the image above. I don't know why I'm not getting results like you have shown. I'm confused!
I have used the ver as noted in the code and older as well. All worked well. Haven't used newer ver. Maybe you could check the BAM file and see if you see spliced-alignment in IGV. A RNA-seq file aligned by a splice aware aligner, when loaded into IGV would show islands of reads linked by connecting bridges between the exons for any given gene. If you see this in your BAM on IGV then the transcript assembler should work. Also if you choose the Sashimi plot option in IGV you should see connecting loops (indicating split alignment).
So you hit the nail on the head. There was a lack of communication in our lab. I thought that my reads had been aligned (by my colleague) with Subread using the "subjunc" function. Turns out it hadn't and therefore didn't have exon junction info. Once aligned properly and checked as you suggested via IGV, it ran perfectly with stringtie and I can see the results I was after. Feel like a bit of a dunce but thank you for helping me get to the bottom of my problem!
hi Amitm, kindly check this error.i did not get it and solve it .
./stringtie G1_sorted.bam -B -o G1.gtf -G Triticum_aestivum.IWGSC.42.gtf -p 4 -C G1.refs.gtf -A G1.abund.tab -WARNING: no reference transcripts were found for the genomic sequences where reads were mapped! Please make sure the -G annotation file uses the same naming convention for the genome sequences
Kindly set this command for my sample.
as it is written in the error message, check if -G annotation file uses the same naming convention for the genome sequences. If your bam files are aligned to a genome with
chrprefix then your GTF should also have
chr, it usually happens when the reads are aligned to a genome fasta from UCSC and the GTF is from ENSEMBL.