Question: what do the headers of the rMATS output files mean?
0
gravatar for atcggcta
13 months ago by
atcggcta0
atcggcta0 wrote:

Hello!

I'll start by saying I'm quite new to MATS and much of Bioinformatics in general so please excuse any ignorance I may be showing

I have a question concerning the columns in the MATS output. There are a few headers that I don't understand the meaning of. This is an example of the headers in my output of SE detection.

ID  | GeneID | geneSymbol | chr | strand | exonStart_0base | exonEnd | upstreamES | upstreamEE| downstreamES | downstreamEE | ID | IC_SAMPLE_1 | SC_SAMPLE_1 | IC_SAMPLE_2 | SC_SAMPLE_2 | IncFormLen | SkipFormLen | PValue | FDR | IncLevel1 | IncLevel2 | IncLevelDifference |

I'm having trouble figuring out what many of these headers mean. Going left to right I understand up to "upstreamES" then from there on I'm confused. Ive tried looking for a description of this table online and have found no useful results (just people with the same question) so even just pointing me in the right direction would be appreciated

So my question is:

What do the headers in columns 8 through 23 mean?

OR

Where can I look to find descriptions of what the MATS output means?

Thank you in advance!

ADD COMMENTlink modified 9 months ago by Rose0 • written 13 months ago by atcggcta0
6
gravatar for caggtaagtat
10 months ago by
caggtaagtat240
caggtaagtat240 wrote:

Part 1/2

Hi,

although it's probably too late and I'm also just starting with rMATS I think I can try to explain those columns:

One row in the SE file contains information about an exon, which was at least skipped once in one of the two samples (either at condition1 or at condition2).


upstreamES and upstreamEE

The column upstreamES stands for upstreamExonStart and the column upstreamEE stands for upstreamExonEnd, the same applies to the downstream exon. These columns hold the position on the chromosome of the nucleotide which is at the upstream end (ExonStart) or downstream end (ExonEnd) of the flanking exons.

skipped exon

Positions of the different exon borders. The exon in the middel is the respective exon in a row of the SE file of the rMATS output.


IC_SAMPLE_1 and SC_SAMPLE_1

The column IC_SAMPLE_1 holds the number of reads (of sample1), which were assgined to events, where there was an inclusion of the respective exon, meaning, that the exon would be present in the final processed mRNA transcript after splicing.

The column SC_SAMPLE_1 holds the number of reads (of sample 1), which were assigned to events, where it seemed , that the respective exon got skipped.

If you have replicates of your sample 1, the respective read counts will be seperated by comma in those columns. It's described in the rMATS paper with following image.

Unbenannt

Image 1 of the rMATS paper (http://www.pnas.org/content/111/51/E5593.full.pdf?with-ds=yes) where I stands for reads, which are counted as inclusion events and S stands for reads, which are counted as skipping events


IncFormLen and SkipFormLen

According to the "SI materials and methods" part of the rMATS paper (http://www.pnas.org/content/suppl/2014/10/14/1415762111.DCSupplemental/pnas.1415762111.sapp.pdf), the 2 columns are just used to normalize the isoform-specific read counts - meaning they are used to calculate the columns IncLevel1 and IncLevel2- like this:

ψ = (I/LI) / (I/LI + S/LS)

where: ψ = Inclusion Level (IncLevel), I = number of reads mapped to the exon inclusion isoform (IC_SAMPLE_1), S = number of reads mapped to the exon skipping isoform (SC_SAMPLE_1), LI = effective length of the exon inclusion isoform ( IncFormLen), LS = effective length of the exon skipping isoform (SkipFormLen)

The columns are calcuated like this:

LI = 2( j - r + 1 )

LS = j - r + 1

where: j = junction length, r = read length of your rMATS experiment, LI = effective length of the exon inclusion isoform ( IncFormLen), LS = effective length of the exon skipping isoform (SkipFormLen)

What's the junction length j though? According to Shen Shihao, it's "the overall junction region covered by reads across junctions, affected by read length, anchor length (by default 8 bps in both upstream and downstream exons) and exon length."

j is calcualted as (read length - anchor)*2

"If the exon is shorter than (read length - anchor), the junction lengh will be reduced."

An example calcualtion of the columns IncFormLen and SkipFormLen can be found in following Google Group conversation : https://groups.google.com/forum/#!topic/rmats-user-group/d7rzUBKXF1U


PValue and FDR

The PValue column is discribed in the rMATS paper like this:

"rMATS uses a likelihood-ratio test to calculate the P value that the difference in the mean ψ values between two sample groups exceeds a given threshold"

The documentation of rMATS states, that the statistic module of rMATS calculates the P-value (PValue) and the false discovery rate (FDR) that the difference in the isoform ratio of a gene between two conditions exceeds a given user-defined threshold.

Meaning, that a row with e.g a PValue entry of 0.0001, and a FDR entry of close to zero, means that there can be found a statistically highly significant difference between the columns IncLevel1 and IncLevel2 of that row.

Be aware, that in some cases, Exel or e.g. RStudios will show you an entry of zero, when the respective value in those columns is lower than 2.2e-16.

ADD COMMENTlink modified 10 months ago • written 10 months ago by caggtaagtat240
1

In a recent post I found from their google group, it seems the made some changes to the calculation of effective length since 3.2.X, in order to cope with reads spanning multiple exons. And new definition seems to be read_length-1. https://groups.google.com/forum/#!searchin/rmats-user-group/read$20length$203.2.X%7Csort:date/rmats-user-group/DeTfsq3Llbw/J7vUBe2DAAAJ

ADD REPLYlink modified 3 months ago • written 3 months ago by alexyfyf10
6
gravatar for caggtaagtat
10 months ago by
caggtaagtat240
caggtaagtat240 wrote:

Part 2/2

IncLevel1 and IncLevel2

Like already mentioned above, IncLevel1 and IncLevel2 is calculated like this:

ψ = (I/LI) / (I/LI + S/LS)

where: ψ = Inclusion Level (IncLevel1), I = number of reads mapped to the exon inclusion isoform (IC_SAMPLE_1), S = number of reads mapped to the exon skipping isoform (SC_SAMPLE_1), LI = effective length of the exon inclusion isoform ( IncFormLen), LS = effective length of the exon skipping isoform (SkipFormLen)

If we look at the equation above, we see, that the column IncLevel1 holds the information about, how often the respective exon in average was included in the final mRNA transcripts in sample 1.

The entry in IncLevel can be seen as the percentage of normalized read counts in the neighborhood of the respective exon, which indicate a splicing event, where the exon was included in the final processed mRNA transcripts.

If the entry is 1, the exon would have never been skipped. Actually you should not be able to find any exon in your SE Output file of rMATS, which has an entry 1 in both columns IncLevel1 and IncLevel2 . However, I think due to rounding errors, you can still find a few of those exons. Nevertheless, per definition of the SE Output file, there are definitely no exons, which have zero aligns in the columns SC_SAMPLE_1 and SC_SAMPLE_2.

sashimi

Example of a low (purple) and high (red) entry in IncLevel, from the documentation of the rmats2sashimiplot tool


IncLevelDifference

The column IncLevelDifference just holdes the difference between the IncLevel columns. Its calculated by:

IncLevelDifference = IncLevel1 - IncLevel2

Therefore, if the result is greater than 0, the entry in IncLevel2 is lower than in IncLevel1 and if it is lower than zero, the entry in IncLevel1 is lower than in IncLevel2.


OK but what's with the other files?

The columns of the output files of rMATS are not very different and the meaning of the columns is self-explaining most of the times.

However, for example in the file of alternative used 5' splice sites, you can find the columns IC_SAMPLE_1 and SC_SAMPLE1 as well. In this case, those columns do not refer to the inclusion or the skipping of an exon, of course, but the inclusion or skipping of an splice donor (5'ss). Therefore, the column IC_SAMPLE_1 states the read counts, where the respective 5'ss was used and the column SC_SAMPLE1 states the read counts, where the respective 5'ss was not used.


List of sources

User guide to rMATS: http://rnaseq-mats.sourceforge.net/user_guide.htm

rMATS paper + SI materials and methods:

http://www.pnas.org/content/111/51/E5593.full.pdf?with-ds=yes

http://www.pnas.org/content/suppl/2014/10/14/1415762111.DCSupplemental/pnas.1415762111.sapp.pdf

Different google groups, mostly the rMATS User Group: https://groups.google.com/forum/#!forum/rmats-user-group

Documentation of rmats2sashimiplot: https://github.com/Xinglab/rmats2sashimiplot/blob/master/README.md


All information is subject to change!

Since I just started using rMATS, I may have understood something wrong. Every correction/addition you may can contribute, will be gratefully appreciated!

ADD COMMENTlink modified 10 months ago • written 10 months ago by caggtaagtat240

Perfect explanation! cannot get more accurate..... it can get more confusing(but helpful) if you started explaining about other events and how they are calculated.

ADD REPLYlink modified 10 months ago • written 10 months ago by badribio230

Wow! That is extremely helpful and detailed! I did end up doing more research on this myself and everything here seems to correlate with my understanding -- If I find anymore useful information I will post it here.

Thank you 1000 times over!

ADD REPLYlink written 10 months ago by atcggcta0
0
gravatar for Rose
9 months ago by
Rose0
Rose0 wrote:

The explanation was really helpful. But, I have another doubt. rMats was run on two samples with triplicates, and I got the outputs. In that, one of the output file, SE.MATS.JunctionCountOnly.txt , some of the values in IJC_SAMPLE_1 , SJC_SAMPLE_1, IJC_SAMPLE_2 and SJC_SAMPLE_2 were in the form given below.

IJC_SAMPLE_1 SJC_SAMPLE_1 IJC_SAMPLE_2 SJC_SAMPLE_2 205165220 2031,1593,1990 33,34,26 1510,1453,1357 1919,2572,2478 211264279 2879,2485,2185 113,91,83 105128143 679791704 30,35,35 752854767 289335392 135162162 275275280 36,23,26

Some of them were separated by commas, meaning that, they are the inclusion and skipping junction counts for triplicates. But what about the single values like 211264279.

ADD COMMENTlink written 9 months ago by Rose0

This could be a issue with the programm you opened the file with. Is it still there, if you open the file with just some text editor, instead of e.g. exel?

ADD REPLYlink written 7 months ago by caggtaagtat240
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 658 users visited in the last hour