kissplice SNP output and kissDE runtime
2
1
Entering edit mode
4.1 years ago
erica ▴ 10

Hi All, I'm working through the kissplice TWAS pipeline to identify condition-specific SNPs among 3 phenotypes in a wild fish species. Its quite a large dataset, with 4 or 5 replicates per phenotype and 10M paired reads per replicate. Using default parameters and the --experimental algorithm (commands below), kissplice finds over 1 million Type0a SNPs. Increasing the relative filtering cutoff to 10% (-C 0.1) still finds over 700,000 SNPs. This seems huge given that the phenotypes represent alternative outcomes of phenotypic plasticity, so do not result from fixed genetic differences. Also, when I use the results files in kissDE (v1.5.0), the kissplice2counts function runs for hours (so far overnight and still going). I can run the example file with this command in a second. While my result files contain a huge number of SNPs, this function should simply be creating a data frame for further analysis? What is the expected runtime for this?

I am interested in SNPs that are fixed or close to fixed among my conditions (e.g. that there may be underlying genetic predisposition to transition among the 3 phenotypes). I had planned to filter the data frame (e.g. remove SNPS with <10 reads mapped in at least one sample) in R, but am concerned the files are too large for the kissplice2counts function.

I'm surprised at the number of SNPs. We have a rough draft genome and estimate genome-wide heterozygosity at ~1.15%, so neither low or excessive. I'm running kissplice again with a higher kmer value of 50, to hopefully reduce the number of SNPs further.

Any advice on an expected kissDE runtime or ways to refine the number of SNPs returned given my dataset, would be very much appreciated. Thanks in advance.

kissDE command: snpCounts <- kissplice2counts("results_0.1_type0a.fa", pairedEnd=TRUE) Running in R version 3.4.4., on a Macbook Pro 3.1 GHz Processor, 16 GB memory.

Kissplice commands: kissplice -t 6 -s 1 -k 41 --experimental -r filenames.... kissplice -t 6 -s 1 -k 41 --experimental -C 0.1 -r filenames....

rna-seq R kissDE kissplice SNP • 1.4k views
1
Entering edit mode
4.1 years ago

Dear Erica,

We are investigating the kissDE performance to answer you. What I might help you right now is on removing SNPS with <10 reads mapped in at least one sample, just after running KisSplice.

You could process the fields

...|C1_X|C2_Y|C3_Z|...


in the header of the SNPs and remove the SNPs such that there exists one CX_Y, with Y < 10, in either the upper or the lower path.

Since such post-process would be done in a script after running KisSplice, and before running kissDE, it will run fast and could potentially filter out many bubbles.

Please tell us if you need help on this script, and if it suits your needs.

Kind regards.

0
Entering edit mode

Thanks for the quick response, and the helpful suggestion. Some help with the script would be fantastic - as I have 3 conditions, I would be concerned that removing all SNPs with <10 counts in at least one individual might remove instances of differential expression between two conditions? Best regards, Erica.

0
Entering edit mode

Dear Erica,

Did the Bioconductor version of kissDE work for you?

Here you can find a very simple script to filter out some lowly-expressed bubbles: https://www.dropbox.com/s/6zfx946kqo7yxfb/remove_lowly_covered_bubbles.py?dl=1

You call it as:

python remove_lowly_covered_bubbles.py <file> <count_threshold> <min_nb_files>


where:

<file> is the bubbles file,
<count_threshold> is an integer specifying what would be low expression (e.g. 10),
<min_nb_files> is the number of conditions CX_Y, with Y < <count_threshold>, needed to filter out the bubble.
The output will be written to <file>._filtered_<count_threshold>_<minNbFile>.fa


This script is far from ideal (i.e. it does not take into account biological replicates and conditions), but it can filter out many of uninteresting bubbles, with fairly low counts.

For example, to filter out all bubbles where the upper or the lower path contains more than 4 read files with less than 10 reads mapping to it, we would use:

python remove_lowly_covered_bubbles.py <file> 10 4


If it does not help you, maybe we should take into conditions and biological replicates.

1
Entering edit mode
4.1 years ago

Dear Erica,

Firt of all, a new version of kissDE is now available on Bioconductor (development version): https://bioconductor.org/packages/devel/bioc/html/kissDE.html.

You can install it by installing the devel version of Bioconductor or directly installing the source package. Be aware, that we had to modify the version numbering to submit to Bioconductor. So the current version 0.99.3 is actually a newer version than the one you are using.

In this new release, we improved the running time of the kissplice2counts function. We are still working on improving the global processing time of the package. You can find in the package vignette some running time estimation on smaller datasets. In this vignette, you will see that the limiting step is the differential analysis.

Even with this new version of kissDE, because of the very high number of SNP in your dataset, the loading of the data will still be quite long and the running time of the differential analysis could be problematic. So I definitely recommend to filter your data before kissDE as suggested by Leandro.

Best,

Clara

0
Entering edit mode

Thank you Clara. I will try and install the new Bioconductor version of kissDE, and filter the kissplice output to drop the number of SNPs to a more manageable size. Will let you know how things run.

Best regards, Erica.