Tools To Do the Alternative Splicing Analysis
4
6
Entering edit mode
5.8 years ago
52Teth ▴ 70

I want to do the alternative splice analysis with a set transcript sequence. The referenced genome sequence is also known.  So Is there any tools avaliable to do this?

RNA-Seq AS • 9.9k views
0
Entering edit mode

Can you elaborate what do you mean by "alternative splice analysis" ? Generally there are tools like cufflinks/cuffdiff for assembling transcripts from short reads and differential analysis.

0
Entering edit mode

Hello, rMATS is a tool i am using, you can have a look.

20
Entering edit mode
2.9 years ago

Since you posted this with the tag RNA-seq I will assume that you have RNASeq data. Since you mention transcripts I will also assume you have transcript level quantifications (else you can read more about how to obtain such transcript level quantification here.). I will also assume that you have at least two conditions (since else there is rarely a reason for doing RNASeq).

Assuming I am right there are 3 types of splicing analysis you can do:

Type 1: Exon based analysis:

The idea is simply that you analyze one exon at the time an see if an exon is differentially used (compared to all other exons in that gene). According to amongst others this article the best tool for doing this is DEXSeq (Bioconductor page, article link) which also allows visualisation of the changes. This is a powerfull way of analysing the data but can be hard to interpret the results.

Type 2: Splicing based analysis:

The idea with this type of analysis is to look at each splice event (exon skipping, alternative donor/acceptor etc) one at the time and see if there are systematic changes between conditions. The better tools for this is rMATS (as also mentioned by @MatthewP) and SUPPA2. This type of analysis is easier to interpret from a splicing perspective but harder to draw biological conclusions from.

There are also an extension of this analysis type which looks at groups of splice-events and detect changes within that group. This can be done with tools such as LeafCutter (github, article as also mentioned by @Prakash) or MAJIQ (more info here) these tools typically give more power but are even harder to interpret (except for something with splicing changed).

Type 3: Transcript based analysis:

The idea with this type of analysis it to utilize the transcript level quantification of the RNASeq data to detect changes in which transcripts are used in the two conditions (isoform switches). Although this is from a computational point harder (less events are found) the biological interpretation is a lot easier because you know the entire transcript. According to this paper again DEXSeq (adobted to transcript expression) seems to be the best tool to find such changes.

For the biological interpertration of such isoform switches I have created an R package called IsoformSwitchAnalyzeR. IsoformSwitchAnalyzeR enables identification and analysis of alternative splicing as well as isoform switches with predicted functional consequences (such as gain/loss of protein domains etc) from quantification by Kallisto, Salmon, Cufflinks/CuffFiff, RSEM etc.

IsoformSwitchAnalyzeR allows for analysis and visualizations of both individual genes as well as genome wide analysis of changes in both splicing and isoform switch consequences. You can see examples of the analysis types available here. For more info you can see how the analysis of switch consequences can be used in this article and for more info on the genome wide analysis take a look at this paper.

Cheers

Kristoffer

2
Entering edit mode

Kristoffer, can I ask why you don't mention Sleuth and Ballgown which, as far as I understand, were developed for a transcript based differential expression analysis starting with transcript quantifications? Thanks.

4
Entering edit mode

Hi Chris

I did not mention those tools because transcript differential expression (DTE) is not good for analysing splicing. The problem is that they do not take the parent gene expression into account. An easy example is a gene with two isoforms were both are equally upregulated - both will be DTE but nothing changed with regards to alternative splicing. Therefore isoform switches (the transcript based analysis) are more suited for splicing analysis.

And btw with regards to ballgown please see this answer (TL;DR: dont use it :-) ).

0
Entering edit mode

I see. That is also interesting about ballgown.

I like your package btw, but the vignette and manual are full of many spelling errors you should correct.

0
Entering edit mode

Thank you so much for this! You are an angel! One question, how do we import our data into R if we do transcript quantification on a cluster? The resulting transcript quantification is very large and can't be imported outside of the cluster and R analysis is much easier outside of a cluster.

0
Entering edit mode

I would just use the save() or saveRDS() function to save the R object(s), transfer them and then use load() or readRDS() to open them on your local computer.

0
Entering edit mode

Hi Kristoffer,

A question, are those analysis possible for a non-model organism? I have a genome with a gff3 file but no transcriptome so I did a de novo assembly using Trinity. I map the Trinity transcriptome to the genome using GMAP to get a gff3 file. My quantification was made using kallisto indexed to the Trinity transcriptome,

Is the transcript based analysis possible using this data? And what annotation file should I use? the one from my genome or the one from my transcriptome?

I tried to make the analysis using IsoformSwitchAnalyzeR (version 1.11.3) with my Trinity transcriptome but I am getting an error (I already tried all the recommendations mentioned in the error, nothing worked):

• importIsoformExpression is working fine to load the kallisto data.
• I converted my trinity gff3 file to a gtf
• Ran importRdata() and I am getting the following error:

aSwitchList <- importRdata( isoformCountMatrix = kallisto_quant$counts, isoformRepExpression = kallisto_quant$abundance, designMatrix = design_mat,
isoformExonAnnoation = "trinity_new.gtf", isoformNtFasta = "../map_deg_to_genome/Trinity-GG.fasta" )

this is giving me this error:

Step 1 of 6: Checking data...
Step 2 of 6: Obtaining annotation...
importing GTF (this may take a while)
Error in importRdata(isoformCountMatrix = kallisto_quant$counts, isoformRepExpression = kallisto_quant$abundance,  :
The annotation and quantification (count/abundance matrix and isoform annotation) seems to be different (Jaccard similarity < 0.925).
Either isforoms found in the annotation are not quantifed or vise versa.
Specifically:
214064 isoforms were quantified.
193805 isoforms are annotated.
Only 193805 overlap.
20259 isoforms quantifed isoforms had no corresponding annotation

This combination cannot be analyzed since it will cause discrepancies between quantification and annotation thereby skewing all analysis.

If there is no overlap (as in zero or close) there are two options:
1) The files do not fit together (e.g. different databases, versions, etc) (no fix except using propperly paired files).
2) It is somthing to do with how the isoform ids are stored in the different files. This problem might be solvable using some of the 'ignoreAfterBar', 'ignoreAfterSpace' or 'ignoreAfterPeriod' arguments.
Examples from expression matrix are : TRINITY_GG_991_c2534_g1_i1, TRINITY_GG_1608_c197_g1_i3, TRINITY_GG_1283_c240_g1_i4
Examples of annoation are : TRINITY_GG_1111_c613_g1_i2, TRINITY_GG_1949_c62_g1_i2, TRINITY_GG_10_c1064_g1_i1
Examples of isoforms which were only found im the quantification are  : TRINITY_GG_1361_c593_g1_i6, TRINITY_GG_936_c25_g1_i3, TRINITY_GG_1822_c46_g1_i1


If there is a large overlap but still far from complete there are 3 possibilites: 1) The files do not fit together (e.g different databases versions etc.) (no fix except using propperly paired files). 2) If you are using Ensembl data you have supplied the GTF without phaplotyps. You need to supply the <ensembl_version>.chr_patch_hapl_scaff.gtf file - NOT the <ensembl_version>.chr.gtf 3) One file could contain non-chanonical chromosomes while the other do not (might be solved using the 'removeNonConvensionalChr' argument.) 4) It is somthing to do with how a subset of the isoform ids are stored in the different files. This problem might be solvable using some of t

Please let me know if this type of analysis is possible with my data or if I am doing something wrong that is giving me the error.

Thank you!

3
Entering edit mode
2.9 years ago
Prakash ★ 2.1k

I would like to add one more tool recently out is Annotation-free quantification of RNA splicing using LeafCutter

1
Entering edit mode
2.9 years ago
chris86 ▴ 370

You have a choice whether to make new transcript models or not, this depends on whether you want to try and find new transcripts. Otherwise you can just use the latest annotations in the ensembl gtf file. When you have worked out which way to go, then there are a few different pipelines one can use. Some tools do transcript level DE, others do exon level DE, both come at the splicing question from a slightly different direction. Here are some workflows:

HISAT2, stringtie, and ballgown # generates new transcript models and does transcript level DE
HISAT2, stringtie, and DEXSEQ # generates new transcript models and does exon level DE
HISAT2 and DEXSEQ # original transcript models and exon level DE
Salmon/ Kallisto and sleuth # original transcript models and transcript level DE (could generate a new transcriptome here as well)

1
Entering edit mode
2.9 years ago

I think there are a lot of good suggestions, but here is one more option that I don't believe has been mentioned yet:

QoRTs (quantification) + JunctionSeq (visualization and differential splicing, derived from DEXSeq)

While I think this is arguably a less popular option, it is actually my preferred choices for initial analysis. You need replicates, but I think that is even more important for junction/exon analysis (than gene analysis)

0
Entering edit mode

That seems pretty useful (nice extension of the exon-based approach I described above) - and the differential analysis of both junction and exons uses DEXSeq as the backbone? Does it support counts from other tools than QoRTs?

1
Entering edit mode

Dispersions are estimated separately for exons and junctions.

I've only tested it with QoRTS (they were developed by the same person).

So, I think the quantification answer is "no," but it would be better to check with the developer:

https://github.com/hartleys/JunctionSeq/issues

I also there is another site worth checking out for documentation: http://hartleys.github.io/JunctionSeq/index.html