Hi All, I'm working through the kissplice TWAS pipeline to identify condition-specific SNPs among 3 phenotypes in a wild fish species. Its quite a large dataset, with 4 or 5 replicates per phenotype and 10M paired reads per replicate. Using default parameters and the --experimental algorithm (commands below), kissplice finds over 1 million Type0a SNPs. Increasing the relative filtering cutoff to 10% (-C 0.1) still finds over 700,000 SNPs. This seems huge given that the phenotypes represent alternative outcomes of phenotypic plasticity, so do not result from fixed genetic differences. Also, when I use the results files in kissDE (v1.5.0), the kissplice2counts function runs for hours (so far overnight and still going). I can run the example file with this command in a second. While my result files contain a huge number of SNPs, this function should simply be creating a data frame for further analysis? What is the expected runtime for this?
I am interested in SNPs that are fixed or close to fixed among my conditions (e.g. that there may be underlying genetic predisposition to transition among the 3 phenotypes). I had planned to filter the data frame (e.g. remove SNPS with <10 reads mapped in at least one sample) in R, but am concerned the files are too large for the kissplice2counts function.
I'm surprised at the number of SNPs. We have a rough draft genome and estimate genome-wide heterozygosity at ~1.15%, so neither low or excessive. I'm running kissplice again with a higher kmer value of 50, to hopefully reduce the number of SNPs further.
Any advice on an expected kissDE runtime or ways to refine the number of SNPs returned given my dataset, would be very much appreciated. Thanks in advance.
kissDE command: snpCounts <- kissplice2counts("results_0.1_type0a.fa", pairedEnd=TRUE) Running in R version 3.4.4., on a Macbook Pro 3.1 GHz Processor, 16 GB memory.
Kissplice commands: kissplice -t 6 -s 1 -k 41 --experimental -r filenames.... kissplice -t 6 -s 1 -k 41 --experimental -C 0.1 -r filenames....