what do the headers of the rMATS output files mean?
4
2
Entering edit mode
4.4 years ago
atcggcta ▴ 50

Hello!

I'll start by saying I'm quite new to MATS and much of Bioinformatics in general so please excuse any ignorance I may be showing

I have a question concerning the columns in the MATS output. There are a few headers that I don't understand the meaning of. This is an example of the headers in my output of SE detection.

ID  | GeneID | geneSymbol | chr | strand | exonStart_0base | exonEnd | upstreamES | upstreamEE| downstreamES | downstreamEE | ID | IC_SAMPLE_1 | SC_SAMPLE_1 | IC_SAMPLE_2 | SC_SAMPLE_2 | IncFormLen | SkipFormLen | PValue | FDR | IncLevel1 | IncLevel2 | IncLevelDifference |


I'm having trouble figuring out what many of these headers mean. Going left to right I understand up to "upstreamES" then from there on I'm confused. Ive tried looking for a description of this table online and have found no useful results (just people with the same question) so even just pointing me in the right direction would be appreciated

So my question is:

What do the headers in columns 8 through 23 mean?

OR

Where can I look to find descriptions of what the MATS output means?

MATS Alternative splicing next-gen STAR • 11k views
26
Entering edit mode
4.1 years ago
caggtaagtat ★ 1.5k

Part 1/2

Hi,

although it's probably too late and I'm also just starting with rMATS I think I can try to explain those columns:

One row in the SE file contains information about an exon, which was at least skipped once in one of the two samples (either at condition1 or at condition2).

upstreamES and upstreamEE

The column upstreamES stands for upstreamExonStart and the column upstreamEE stands for upstreamExonEnd, the same applies to the downstream exon. These columns hold the position on the chromosome of the nucleotide which is at the upstream end (ExonStart) or downstream end (ExonEnd) of the flanking exons.

Positions of the different exon borders. The exon in the middel is the respective exon in a row of the SE file of the rMATS output.

IC_SAMPLE_1 and SC_SAMPLE_1

The column IC_SAMPLE_1 holds the number of reads (of sample1), which were assgined to events, where there was an inclusion of the respective exon, meaning, that the exon would be present in the final processed mRNA transcript after splicing.

The column SC_SAMPLE_1 holds the number of reads (of sample 1), which were assigned to events, where it seemed , that the respective exon got skipped.

If you have replicates of your sample 1, the respective read counts will be seperated by comma in those columns. It's described in the rMATS paper with following image.

Image 1 of the rMATS paper (http://www.pnas.org/content/111/51/E5593.full.pdf?with-ds=yes) where I stands for reads, which are counted as inclusion events and S stands for reads, which are counted as skipping events

IncFormLen and SkipFormLen

According to the "SI materials and methods" part of the rMATS paper (http://www.pnas.org/content/suppl/2014/10/14/1415762111.DCSupplemental/pnas.1415762111.sapp.pdf), the 2 columns are just used to normalize the isoform-specific read counts - meaning they are used to calculate the columns IncLevel1 and IncLevel2- like this:

ψ = (I/LI) / (I/LI + S/LS)

where: ψ = Inclusion Level (IncLevel), I = number of reads mapped to the exon inclusion isoform (IC_SAMPLE_1), S = number of reads mapped to the exon skipping isoform (SC_SAMPLE_1), LI = effective length of the exon inclusion isoform ( IncFormLen), LS = effective length of the exon skipping isoform (SkipFormLen)

The columns are calcuated like this:

LI = 2( j - r + 1 )

LS = j - r + 1

where: j = junction length, r = read length of your rMATS experiment, LI = effective length of the exon inclusion isoform ( IncFormLen), LS = effective length of the exon skipping isoform (SkipFormLen)

What's the junction length j though? According to Shen Shihao, it's "the overall junction region covered by reads across junctions, affected by read length, anchor length (by default 8 bps in both upstream and downstream exons) and exon length."

j is calcualted as (read length - anchor)*2

"If the exon is shorter than (read length - anchor), the junction lengh will be reduced."

An example calcualtion of the columns IncFormLen and SkipFormLen can be found in following Google Group conversation : https://groups.google.com/forum/#!topic/rmats-user-group/d7rzUBKXF1U

PValue and FDR

The PValue column is discribed in the rMATS paper like this:

"rMATS uses a likelihood-ratio test to calculate the P value that the difference in the mean ψ values between two sample groups exceeds a given threshold"

The documentation of rMATS states, that the statistic module of rMATS calculates the P-value (PValue) and the false discovery rate (FDR) that the difference in the isoform ratio of a gene between two conditions exceeds a given user-defined threshold.

Meaning, that a row with e.g a PValue entry of 0.0001, and a FDR entry of close to zero, means that there can be found a statistically highly significant difference between the columns IncLevel1 and IncLevel2 of that row.

Be aware, that in some cases, Exel or e.g. RStudios will show you an entry of zero, when the respective value in those columns is lower than 2.2e-16.

2
Entering edit mode

In a recent post I found from their google group, it seems the made some changes to the calculation of effective length since 3.2.X, in order to cope with reads spanning multiple exons. And new definition seems to be read_length-1. https://groups.google.com/forum/#!searchin/rmats-user-group/read$20length$203.2.X%7Csort:date/rmats-user-group/DeTfsq3Llbw/J7vUBe2DAAAJ

0
Entering edit mode

I am using rMATS 4.0.2 and results header is changed. Is this explanation is same for updated version? I just want to understand about these name longExonStart_0base longExonEnd shortES shortEE flankingES flankingEE.

According to above explanation skipped exon position should lie between upstreamES and downstreamES but i didn't find this type to patter as in example - ENSG00000141480 ARRB2 chr17 + 4715150 4715336 4715199 4715336 4715012 4715043

0
Entering edit mode

This question was concerning the SE file, which contains information about differences in exon skipping. I only know the columns you named from the ASS file, concerning alternative splice site usage. Per definition, alternative splice site usage either lead to an longer or shorter exon, depending on the position of the alternativly used splice site.

14
Entering edit mode
4.1 years ago
caggtaagtat ★ 1.5k

Part 2/2

IncLevel1 and IncLevel2

Like already mentioned above, IncLevel1 and IncLevel2 is calculated like this:

ψ = (I/LI) / (I/LI + S/LS)

where: ψ = Inclusion Level (IncLevel1), I = number of reads mapped to the exon inclusion isoform (IC_SAMPLE_1), S = number of reads mapped to the exon skipping isoform (SC_SAMPLE_1), LI = effective length of the exon inclusion isoform ( IncFormLen), LS = effective length of the exon skipping isoform (SkipFormLen)

If we look at the equation above, we see, that the column IncLevel1 holds the information about, how often the respective exon in average was included in the final mRNA transcripts in sample 1.

The entry in IncLevel can be seen as the percentage of normalized read counts in the neighborhood of the respective exon, which indicate a splicing event, where the exon was included in the final processed mRNA transcripts.

If the entry is 1, the exon would have never been skipped. Actually you should not be able to find any exon in your SE Output file of rMATS, which has an entry 1 in both columns IncLevel1 and IncLevel2 . However, I think due to rounding errors, you can still find a few of those exons. Nevertheless, per definition of the SE Output file, there are definitely no exons, which have zero aligns in the columns SC_SAMPLE_1 and SC_SAMPLE_2.

Example of a low (purple) and high (red) entry in IncLevel, from the documentation of the rmats2sashimiplot tool

IncLevelDifference

The column IncLevelDifference just holdes the difference between the IncLevel columns. Its calculated by:

IncLevelDifference = IncLevel1 - IncLevel2

Therefore, if the result is greater than 0, the entry in IncLevel2 is lower than in IncLevel1 and if it is lower than zero, the entry in IncLevel1 is lower than in IncLevel2.

OK but what's with the other files?

The columns of the output files of rMATS are not very different and the meaning of the columns is self-explaining most of the times.

However, for example in the file of alternative used 5' splice sites, you can find the columns IC_SAMPLE_1 and SC_SAMPLE1 as well. In this case, those columns do not refer to the inclusion or the skipping of an exon, of course, but the inclusion or skipping of an splice donor (5'ss). Therefore, the column IC_SAMPLE_1 states the read counts, where the respective 5'ss was used and the column SC_SAMPLE1 states the read counts, where the respective 5'ss was not used.

List of sources

User guide to rMATS: http://rnaseq-mats.sourceforge.net/user_guide.htm

rMATS paper + SI materials and methods:

http://www.pnas.org/content/111/51/E5593.full.pdf?with-ds=yes

http://www.pnas.org/content/suppl/2014/10/14/1415762111.DCSupplemental/pnas.1415762111.sapp.pdf

All information is subject to change!

Since I just started using rMATS, I may have understood something wrong. Every correction/addition you may can contribute, will be gratefully appreciated!

1
Entering edit mode

Perfect explanation! cannot get more accurate..... it can get more confusing(but helpful) if you started explaining about other events and how they are calculated.

1
Entering edit mode

Wow! That is extremely helpful and detailed! I did end up doing more research on this myself and everything here seems to correlate with my understanding -- If I find anymore useful information I will post it here.

Thank you 1000 times over!

0
Entering edit mode
4.1 years ago
Rose ▴ 10

The explanation was really helpful. But, I have another doubt. rMats was run on two samples with triplicates, and I got the outputs. In that, one of the output file, SE.MATS.JunctionCountOnly.txt , some of the values in IJC_SAMPLE_1 , SJC_SAMPLE_1, IJC_SAMPLE_2 and SJC_SAMPLE_2 were in the form given below.

IJC_SAMPLE_1 SJC_SAMPLE_1 IJC_SAMPLE_2 SJC_SAMPLE_2 205165220 2031,1593,1990 33,34,26 1510,1453,1357 1919,2572,2478 211264279 2879,2485,2185 113,91,83 105128143 679791704 30,35,35 752854767 289335392 135162162 275275280 36,23,26

Some of them were separated by commas, meaning that, they are the inclusion and skipping junction counts for triplicates. But what about the single values like 211264279.

0
Entering edit mode

This could be a issue with the programm you opened the file with. Is it still there, if you open the file with just some text editor, instead of e.g. exel?

0
Entering edit mode
3.0 years ago
drskm7 • 0

Why their is term/word used as NOVEL SE/MXE/A3SS/A5SS/IR, where is this novel gene list comes from the rMATS result?