Identical splicing events reported twice with KisSplice ?
4
0
Entering edit mode
18 months ago

Hi,

While going trough the results of KissDE, I noticed a strange repetition of events, that I didn't see before the update of Kiss2refgenome (v2.0.0).

Two differents examples here :

1. Two IR strictly identical for NUBP2. For both, the genomic position of each splice site (on the lower path) are 1836656 and 1836758. The variable part length is 101 for both.

The only difference come from the genomic blocs size of the upper path : 177 for one, 178 for the other. So unless I am mistaken, I am looking here at the exact same intron retention. But the event has been reported twice by KisSplice, with only 1 base difference in the sequence, not even in the event itself.

EDIT: both set of sequences (bcc_7866|Cycle_2 and bcc_7866|Cycle_13) have a substitution (C>A) at the exact last base, so it is present both in the upper and the lower path. So outside of the intron.

Anothere example, even stranger :

1. Two IR for TSPAN32. This time there is absolutely no difference. Same bloc size, same splice site, same variable part.

In the end, the only difference I can see, is a little variation in the read coverge. Only one read for only sample in the first example, and one or two read for several samples on the second example. But it is still the same event...

EDIT: both set of sequences (bcc_167629|Cycle_2352655 and bcc_167629|Cycle_2352656) have exactly the same lower path, while there is a substitution in the upper path T>A, so directly in the intron. That explains it I guess.

I might not be clear, so here are the 2 examples with all the data from KissDE : https://docs.google.com/spreadsheets/d/1K9FSZAqcEcu8QLos6yqXAG3BU8LYxX5eI1HWuiDvJBw/edit?usp=sharing

There are several other examples like that, not limited to intron retention, and for several analysis (on completely different samples).

I don't know if the aligner might have something to do with it, but as far as I remember, I have used the same version of STAR, before and after the Kiss2refgenome update.

EDIT : so the culprit was a one base variation. First example, outside of the intron, second example inside the intron. In the end, those events really are duplicates. It is still strange that this type of variation didn't appear before the update. On a 2000 differentially expressed events list, there is something like 150, maybe 200 of thoose "duplicated" events. (With a quick glance, same thing for my other anayses).

EDIT : Maybe I should have mentionned that this analysis was only done with the type_1 file of KisSplice, so it only concern splicing events.

kissplice • 1.0k views
2
Entering edit mode
17 months ago

Hello David,

Sorry for the very late answer!

You are absolutely right, this redundancy removal step has not been implemented in the latest version of kissplice2refgenome, that is a mistake. We will add it as soon as possible! Thank you very much for digging into that bug and report it to us!

Regards, Audric Cologne

0
Entering edit mode

Hi,

thanks for the answer ! So to be clear, there is no reality behind thoose events ? Because I have another case in mind where 2 events seems to be exactly the same, except for the junction. There was a one base difference exactly on the junction site. Shifting it from a canonical, to a none canonical site.

Thanks !

0
Entering edit mode

Hi David,

Short answer is, these events are real as they are supported by reads, but most of the time we should merge them together.

The redundancy problem comes from a particular and key structure of the deBruijn Graph : the bubble. KisSplice is optimised to find such structure because each splicing event will create a bubble in the deBruijn Graph. BUT, not all bubble describe a splicing event. SNV, InDel, inexact repeats , among other, also creat a bubble in a deBruijn Graph. Now, let's say that we have an Intron Retention event (1 bubble), but the retained intron exist in two forms : with or without e deletion. This will create a bubble inside the previous bubble. As a result, KisSplice will output ALL POSSIBLE BUBBLES : spliced intron + retained intron without deletion AND spliced intron + retained intron with deletion. And we have a "duplicated" event. The point is, if one is interested in splicing event, this deletion does not carry useful information.

The main issue is that redundant bubbles will create problem during the quantification step as reads will be multimapped between redundant bubbles (except for the reads with or without the indel in our example, which are the only one to decipher between the two bubbles), and we will end up loosing statistical power by splitting our reads among redundant bubbles.

We are currently working on KisSplice to integrate the redundancy removal, among other major performance and accuracy ameliorations during the quantification step. So, in the near future, KisSplice (and not KisSplice2RefGenome) will merge the redundant bubbles.

I hope this was clear enough... Do not hesitate to ask us any questions, we'll be glad to answer :)

Have a nice day!

0
Entering edit mode

Hi,

It was very clear. It is good to see those new developments for KisSplice !

Thanks for your help and have a nice day !

0
Entering edit mode

Hi to anyone that could be interested!

Due to technical problems, we did not add this feature in KisSplice... yet! But we added it to the latest version of KisSplice2RefGenome, that you can find here: click me

The way this work is not very satisfying: we only keep the first event of any number of duplicated events. We will improve on this duplication problem in new versions of kissplice!

Audric

0
Entering edit mode

Hi,

thank you for the update. However I can't install it.

For KisSplice2RefGenome 2.0.0, I installed it simply with python3 and it worked perfectly :

sudo python3 setup.py install


However, for 2.0.1, I get :

File "setup.py", line 9
except ImportError, e:
^
SyntaxError: invalid syntax


Something about Python3 vs Python2 I guess ? (I am more familiar with Perl than Python to be honest !). As it was not in the setup.py for the version 2.0.0, I removed the ", e:". Of course it doesn't work, but it's making it go further, and stall at :

running install_scripts
running build_scripts
error: file '...../kissplice2refgenome-p3/kissplice2refgenome' does
not exist


Trying to install with python2 doesn't work either.

0
Entering edit mode

Hello,

Oh yeah, the files were mixed up... This should work now! Thanks for the report :)

0
Entering edit mode

Hello,

thanks, it does work now. I also noticed that for the frameshift column, everything have been shifted. The true are now false and vice versa. It is far more logical like that !

Thanks again !

1
Entering edit mode
7 weeks ago

Hello dear user,

The redundancy removal is a post-KisSplice step that was added to kissplice2refgenome in order to remove redundant bubbles corresponding to alternative splicing events only, as kissplice2refgenome is not usable with Type_0 (SNP) bubbles.

Currently, KisSplice does not do any redundancy removal. Triallelic SNPs will result in multiple bubbles (couple of two sequences) sharing one of their path with an other bubble, as KisSplice output all pairwise bubbles found in the deBruijn Graph.

So I do not think that there is a straightforward way to remove tri (or more)-allelic SNP-induce redundancy in KisSplice results.

Also, I think this post may help you as it seems related with what your are observing.

Concerning your second interrogation, multiple SNP will appear in the same bubble if they are separated by k-1 or less nucleotides, k being the k-mers size (41 by default). Note that if the SNPs are perfectly link and within that distance, only one bubble will be outputed, otherwise all haplotype will correspond to a Path and KisSplice will output all pairwise paths.

I am not aware of a way to group the variations as a post-treatment of KiSplice results.

I hope this helps! Audric

0
Entering edit mode
8 weeks ago
Bro • 0

Hi,

I have a similar question. I noticed in the Kissplice output files some redundant sequences. Several sequences are identical with the same identifier and the same count. This is the case for example when there are triallelic SNPs. For example :

bcc_8592|Cycle_6|Type_0a|upper_path_length_63|C1_0|C2_5|Q1_0|Q2_65|rank_1.00000 ACAGGTTGGGATGGAGGGAGTTTACAGGAAGGAGACAGGGCCAACGTCGAAGCCGAATTCCTC bcc_8592|Cycle_6|Type_0a|lower_path_length_63|C1_4|C2_0|Q1_34|Q2_0|rank_1.00000 ACAGGTTGGGATGGAGGGAGTTTACAGGAAGTAGACAGGGCCAACGTCGAAGCCGAATTCCTC bcc_8592|Cycle_7|Type_0a|upper_path_length_63|C1_0|C2_5|Q1_0|Q2_65|rank_0.23570 ACAGGTTGGGATGGAGGGAGTTTACAGGAAGGAGACAGGGCCAACGTCGAAGCCGAATTCCTC bcc_8592|Cycle_7|Type_0a|lower_path_length_63|C1_2|C2_10|Q1_63|Q2_50|rank_0.23570 ACAGGTTGGGATGGAGGGAGTTTACAGGAAGAAGACAGGGCCAACGTCGAAGCCGAATTCCTC

But I found these redundancies in the 6 FASTA files. Is there a way to remove these redundancies? Moreover, some variations are very close but are not grouped in a large bubble. Is there a way to group all these variations under the same identifier?

Thanks, have a nice day !

0
Entering edit mode
7 weeks ago

Dear user,

Indeed, KisSplice is pairwise, and in the case of tri-allelic SNPs, it will report every pair of variant.

bcc_8592|Cycle_6|Type_0a|upper_path_length_63|C1_0|C2_5|Q1_0|Q2_65|rank_1.00000 ACAGGTTGGGATGGAGGGAGTTTACAGGAAGGAGACAGGGCCAACGTCGAAGCCGAATTCCTC bcc_8592|Cycle_6|Type_0a|lower_path_length_63|C1_4|C2_0|Q1_34|Q2_0|rank_1.00000 ACAGGTTGGGATGGAGGGAGTTTACAGGAAGTAGACAGGGCCAACGTCGAAGCCGAATTCCTC Corresponds to G Vs T. G is supported by 0 reads in sample 1, and 5 reads in sample 2. T is supported by 4 reads in sample 1 and 0 reads in sample 2.

bcc_8592|Cycle_7|Type_0a|upper_path_length_63|C1_0|C2_5|Q1_0|Q2_65|rank_0.23570 ACAGGTTGGGATGGAGGGAGTTTACAGGAAGGAGACAGGGCCAACGTCGAAGCCGAATTCCTC bcc_8592|Cycle_7|Type_0a|lower_path_length_63|C1_2|C2_10|Q1_63|Q2_50|rank_0.23570 ACAGGTTGGGATGGAGGGAGTTTACAGGAAGAAGACAGGGCCAACGTCGAAGCCGAATTCCTC Corresponds to G Vs A. G is supported by 0 reads in sample 1, and 5 reads in sample 2. A is supported by 2 reads in sample 1 and 10 reads in sample 2.

There is probably also a third bubble corresponding to T Vs A (possibly bcc_8592|Cycle_8) Overall, in sample 1, T is supported by 4 reads and A by 2 reads. In sample 2, G is supported by 5 reads and A by 10 reads. You can derive allele frequencies from those counts, but these estimates would not be very robust because the coverage is low (the gene is probably poorly expressed). More important, if you want to assess if the allele frequency changed between your conditions, you need replicates. Then you can use KissDE.

Notice however that KissDE requires pairs of variants. We currently do not provide an easy-to-use solution for this special case of tri-allelic SNPs, which is, in principle, quite rare.

If you are interested in alternative splicing events (type1.fa output of KisSplice), then you can use kissplice2refgenome as a post-treatment. It will indeed remove some of the redundancy. In particular in the case where there are SNPs located inside a skipped exon. The skipped exon will be reported twice in KisSplice, but only once in the ouput of KisSplice2RefGenome. AS events located in the same gene will also be assigned the same gene name.

If you do not have a reference genome and you are interested in SNPs, you may be interested in kissplice2reftranscriptome and the following post might be relevant to read: KisSplice/KissDE/kissplice2reftranscriptome filtering advice. Interpretation of the results.

I hope this helps,

Vincent