Question

Differential Expression of Splice Junctions using voom-limma/voom-diffSplice

2

Entering edit mode

7.2 years ago

stefanos.bamopoulos ▴ 40

Hello guys,

I am interested in the differential expression of splice junctions. For this purpose I have a count matrix containing counts (per sample) from reads spanning each splice junction. I want to perform a DE analysis on the splice junction level and I have not found any suitable tools for this task.

I am aware of one method:

http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0136653

but I am reluctant to use it, because DEXSeq is too computationally expensive for the number of samples I have.

I have already used a voom-limma pipeline for DE analysis of genes & transcripts, as well as a voom-diffSplice pipeline for DE analysis of exons.

Now I am curious if a voom pipeline would be suitable for DE of splice junctions (I remember having read somewhere, that limma is suitable for any genomic feature, but since a splice junction consists of 2 borders and does not have actual length, I am wondering if it would be a valid analysis.)
If it is, then I'm unsure if voom-limma or voom-diffSplice is more appropriate for this purpose

I would appreciate any help or caveats regarding such an analysis with limma, or any tip regarding the use of other tools to perform such an analysis.

Thanks in advance!

Stefan

RNA-Seq DE limma splice junctions • 3.4k views

ADD COMMENT • link updated 6.3 years ago by Biostar 20 • written 7.2 years ago by stefanos.bamopoulos ▴ 40

score 0 · Answer 1 · 2017-05-10

0

Entering edit mode

7.2 years ago

Malcolm.Cook ★ 1.5k

Please explain more carefully your experiment and the questions you want to put to your data.

I ask because I expect you are really interested in the "expression of splice junctions" as compared to something else (say, roughly, "expression of exons". )

I guess this because you are considering DEXSeq as being able to answer your questions, and DEXSeq can not use the data you have since it quantifies "differential exon usage" and nowhere in its analysis does it use data such as you have (what might be called "junction read counts").

Its manual does state that " a change in relative exon usage is typically due to a change in the rate with which this exon is spliced into transcripts (alternative splicing)" but the relationship is assumed and neither tested nor modelled.

Here are some questions to answer that might allow folk in this forum to help you better. Don't worry about the algorithm or the test or how "computationally expensive" anything is now. Focus on the science and the data:

Why do you say "DEXSeq is too computationally expensive for the number of samples I have."?
How many conditions do you have, and what are they?
How many biological replicates of each condition?
Any technical replicates?
Do you really only have junction read counts to go by? What happened to the upstream results (alignments, fastqs, etc)?
And most importantly, what are you really hoping to ask of you data?

ADD COMMENT • link 7.2 years ago by Malcolm.Cook ★ 1.5k

0

Entering edit mode

Hello Malcolm,

thank you for your reply. I apologize for not being more specific. I will try to answer your questions as best as I can.

0) I am aware that DEXSeq is used for measuring differential exon usage. The link I attached to my original question links to a paper, in which the authors modified their input data, so that DEXSeq can measure differential splice junction usage. So you are right, it is not stated anywhere in the manual and it is not its original purpose, but nevertheless it has been used to gather the information I am looking for.

1) I say that DEXSeq is too computationally expensive because I ran it for DE of exons and it took more than a week (!) to complete running on 12 cores. I'm pretty sure, that I ran it correctly, since I tested it on both the galaxy platform and on the command line and it did not produce any errors or warnings.

2-4) I have a total of 250 cancer samples, each a biological replicate. I do not have any technical replicates. My conditions are specific point mutations that were identified using a different method (amplicon sequencing). I am mainly interested in 3 of those mutations and I am looking at them independently, because they do not associate with each other. So basically I have a binary condition (patient has the mutation or not) and its corresponding confounders, which I have assessed in a prior analysis.

5) I have RNAsequencing data (fastq files) of all patients. I have aligned them using STAR and quantified them using salmon, but I am free to run any analysis on them, if necessary.

6) I am interested in the DE of splice junctions, because I have reason to believe that the aforementioned mutations lead to errors in splice site recognition and therefore affect specific splice junctions all over the genome and not just single exons or genes. What I am hoping to achieve is to gather all splice junctions that were DE and find similarities among them in a post-hoc analysis.

Thank you for your time!

Stefan

ADD REPLY • link 7.2 years ago by stefanos.bamopoulos ▴ 40

0

Entering edit mode

Hi Stefan,

re 0: thanks for your clarification on the adoption of DEXSeq
re 2-4: do I understand from this that you have 250 individual samples without any biological replicates? If so, this will challenge any differential analysis technique that relies upon variance analysis. However, you might profitably model this with your binary condition, hasTheMutation, in effect giving you 247 of your 250 samples as "replicates" of one class and 3 of the other. Is that how you modelled it?

You might consider to use as your response variable a measure of intron retention, often reported as "PSI" (percent spliced in). PSI might be computed at each intron as I / (I + J), where J is the number of Junction reads (which is handily provided for you as one of STAR's output files) and I is mean coverage of the intronic region. Then your model becomes something like (in R)

glm(PSI ~ hasTheMutation + correspondingConfounders, family = quasibinomial, data=allIntrons)

to be evaluated separately at each intron, possibly limiting your analysis data to just annotated / known introns.

Then, run motif discover (MEME?) and contrast the acceptor/donor site loci of those introns whose PSI is "significantly" determine by hasTheMutation with those that don't.

You might choose to initially "focus" such analysis on introns that don't overlap any other introns.

FWIW, if you pursue this approach, to assist in allowing any visualization/tracks you generate to correspond to your statistics, it helps to put I and J on the same scale by normalizing to RPM (reads / million). STAR emits uniquely mapped read count in its final log files which you can divide by 10**6 and divide that into the junction counts producing J on RPM scale. Similarly, star has option to produce RPM scaled bigwig from which you can compute mean TPM at each intron, producing I on RPM scale. Plotted Result: in IGV, when the coverage bigwig goes to zero at (known?) introns, your J track kicks in at the same value, unless there is intron retention, in which case you I track proportionally kicks in.

ADD REPLY • link 7.2 years ago by Malcolm.Cook ★ 1.5k

0

Entering edit mode

Hi Malcolm,

thank you for your response. You are right, I am using 250 individual samples, without any replicates. As you mentioned yourself I am using the binary variable "hasTheMutation" + confounders to make my model.

What you are proposing sounds interesting, but I am unsure about using PSI as a response variable. Since I am mainly interested in SJs (splice junctions) and not introns I would prefer to directly use the junction reads as an input to a DE-program. This way I would get a direct measure of differential SJ usage and I would have the additional benefit of being able to test unannotated splice junctions for DE. In the paper I linked to in my original question DEXSeq was used for this purpose.

So coming back to my original question I am wondering if I could replicate such an analysis with voom+limma. From my understanding it should be possible since STAR provides in its output file the information about how many reads are supporting a specific splice site, which if I am not mistaken translates to the number of times a particular splice site was "used". This seems to me very similar to a measure of "expression" of e.g. genes or transcripts. What I am unsure about is if using the junction counts as I would use gene counts or transcript counts violates any assumption made by most DE-tools.

Again, thank you very much for your time!

Stefan

ADD REPLY • link 7.2 years ago by stefanos.bamopoulos ▴ 40