Calling de novo SNPs between two species with KisSplice
9
0
Entering edit mode
3.9 years ago
MGlenn • 0

Hello,

I am thinking about using the KisSplice workflow to call SNPs between 2 closely related species. I have a Trinity assembly, and a transdecoder ORF file.

I have 4 paired-end samples total for 2 species - 2 samples per species. I would like to call SNPs between species 1 and species 2, to highlight the functional differences between these species.

According to this: kissplice -s 1 -k 41 --experimental -r cond1_replicat1_R1.fastq -r cond1_replicat1_R2.fastq -r cond1_replicat2_R1.fastq -r cond1_replicat2_R2.fastq -r cond2_replicat1_R1.fastq -r cond2_replicat1_R2.fastq -r cond2_replicat2_R1.fastq -r cond2_replicat2_R2.fastq

Would the SNPs that KisSplice calls be between cond1 and cond2 only? Or would it call SNPs between all combinations of cond1, replicate 1, replicate 2, cond2, replicate 1, replicate 2 etc?

My goal is to observe the number of SNPs that makes species 1 different from species 2, and which of those SNPs would cause an amino acid change.

Please let me know if this workflow is appropriate for my objective. Thank you.

SNP snp kissplice kisssplice2reftranscriptome • 2.0k views
0
Entering edit mode

Everything is solved. KissDE suited my needs well. Sorry for the late response. Thank you very much.

0
Entering edit mode

Since you were the author of the original question please go ahead and accept the actual answer below that helped solve your question. As things stand it is very difficult to figure out which is the real answer and which is clarifying correspondance.

For future reference: please use ADD COMMENT/ADD REPLY when responding to existing posts to keep threads logically organized

2
Entering edit mode
3.9 years ago

Dear MGlenn,

It seems to me that you might be wishing to find condition-specific (or species-specific) SNPs. You could use perhaps kissDE http://kissplice.prabi.fr/tools/kissDE/ kissDE is a R package, similar to DEseq, but which works on pairs of variants, and tests if a variant is enriched in one condition. It has been developped to work easily with KisSplice output. It can also work with a simple table of counts obtained by any other means. It requires at least two replicates per condition and at least two conditions.

In http://kissplice.prabi.fr/TWAS/ we show how you can intersect KisSplice2RefTranscriptome results with condition-specific SNPs by giving to K2rt the output of kissDE (look for the section "Running KissDE (optional)").

It might happen also that I misunderstood your message and my answer does not help you. In this case, please point this out to me, and I am going to further redirect your message to people that are more used to work with SNPs and the KisSplice pipeline...

Kind regards.

2
Entering edit mode
3.9 years ago

Dear user, Finding divergent sites between two species is what we did for D. mojavensis and D. arizonae in our paper. Like you, we had two paired-end samples for each species. I think that KissDE fits your needs.

Below is an example of the KisSplice output we obtain: >bcc_9|Cycle_0|Type_0a|upper_path_Length_83|C1_0|C2_0|C3_0|C4_0|C5_86|C6_187|C7_102|C8_161|Q1_0|Q2_0|Q3_0|Q4_0|Q5_60|Q6_59|Q7_58|Q8_59|rank_1.00000 CCTCAAGCAATGTCCATGCGACTTATTGAAAAATTTACACTGGAATTAGTCCAAGCGTCCATTGAGAACGACAAAAGACTGCA >bcc_9|Cycle_0|Type_0a|lower_path_Length_83|C1_80|C2_165|C3_100|C4_168|C5_0|C6_0|C7_0|C8_0|Q1_59|Q2_59|Q3_61|Q4_58|Q5_0|Q6_0|Q7_0|Q8_0|rank_1.00000 CCTCAAGCAATGTCCATGCGACTTATTGAAAAATTTACACTAGAATTAGTCCAAGCGTCCATTGAGAACGACAAAAGACTGCA

The "SNP" is G/A. The upper path corresponds to G. The lower path to A. C1 to C4 correspond to read counts in mojavensis. C5 to C8 correspond to read counts in arizonae. G is supported by 0 reads in mojavensis, and by (86+187, 102+161) reads in arizonae. A is supported by ~(80+165, 100+168) reads in mojavensis and 0 reads in arizonae. This "SNP" is clearly species specific. One allele is present in one species. The other allele is present in the other species. If you test it in KissDE, you will find a p-value close to 0, and a Deltaf (difference of allele frequency) of 1.

I would recommend you to run kissDE and then re-run the latest version of kissplice2reftranscriptome (v1.3.2) on your dataset. We added two columns in the output: the p-value and the DeltaF. Selecting the SNPs with a DeltaF of 1 or -1 seems to be what you need.

Note that if your two species diverged very recently, then letting the default value k=41 may be a good starting point. This will however lead you to miss the SNPs located less than 41nt apart. I would then recommend trying to decrease k to 25. This will give you more SNPs, in particular in regions which evolved faster since the speciation. Calibrating k using one gene for which you have some expectation may be a good pragmatic approach.

Best,

Vincent

1
Entering edit mode
3.9 years ago

Dear user,

The running time depends on the number of SNPs you have. KissDE should take ~2h for 50000 SNPs. It seems you have many more SNPs. If most of them are cases where one allele got fixed in one species and the other allele got fixed in the other species, then running KissDE might be an overkill. You can simply check the column Allele Frequency in the output of KisSplice2RefTranscriptome. If the allele frequency is 1 for one species and 0 for the other AND read counts are larger than 10, then you probably do not need to test it in KissDE. You readily found a divergent site. You can then filter those out of your KisSplice_type0a.fa file and run KissDE on the remaining SNPs (those where there is still some polymorphism in at least one species). Hopefully, this will be much faster.

Best,

Vincent

0
Entering edit mode

Ah yes, that makes sense to me now. A very good explanation, Thank you very much, Vincent.

0
Entering edit mode
3.9 years ago

Dear MGlenn,

The command you described would call SNPs between all combinations of cond1, replicate 1, replicate 2, cond2, replicate 1, replicate 2 etc. In KisSplice, firstly the bubbles corresponding to SNPs are found, and only after they are quantified according to conditions, replicates, etc. Thus, you will have to filter out the SNPs that you are interested on.

http://nar.oxfordjournals.org/content/early/2016/07/25/nar.gkw655.abstract

http://kissplice.prabi.fr/TWAS/

Kind regards.

0
Entering edit mode
3.9 years ago
MGlenn • 0

Thanks for getting back to me Leandro, I appreciate it. I should have mentioned that I also ran the kissplice2reftranscriptome step as well and obtained the mainOutput.tsv file of SNPs that were called between all combinations of cond1, rep1, rep2, cond2, rep1, rep2 etc.

So, it is possible to filter the variants in such a way that I can see how two samples from one species differ from the samples of the other species? For example: calling SNPs between condition 1 (or species 1) (rep1,rep2) and condition 2 (or species 2) (rep1,rep2), which would highlight SNPs between my species.

I'm just trying to figure out some post-processing steps that I can take to highlight SNPs between species 1 and species 2. Is there a column that describes the samples that the SNP was called between?

M

0
Entering edit mode
3.9 years ago

Thanks again for getting back to me so quickly.

I'm concerned with finding SNPs between conditions, rather than within a condition. For example, if I consider each of my two species a separate condition (species 1 = cond 1, species 2 = cond 2) , then I would like to find SNPs between those conditions, rather than within the condition like kissDE does.

kissDE would tell me which SNPs are enriched in one condition (one species), but instead I would like to find SNPs that occur between my two separate species (SNPs between species 1 and species 2), and see which of those SNPs would cause an amino acid change. Basically, I want the SNPs to highlight functional differences between my two species. I would like to see which SNPs would result in an amino acid sequence if a SNP allele from species 1 was introgressed into species 2.

Does that make sense? Sorry if I'm being confusing. Again, thanks for your help.

0
Entering edit mode
3.9 years ago

Hi Vincent,

Thank you for the detailed explanation, it was extremely useful. Sorry for my misunderstanding, kissDE does suit my needs. I think I am on the right track now. What is the run time of kissDE in your experience? I used kissDE on a cluster previously, and it ran for 10+ days. I was wondering if that run time is to be expected or not.

0
Entering edit mode
3.9 years ago

Is there any sources of script to process the mainOutput.tsv file? Eg. summarizing SNPs based on allele frequencies, read counts, and other important descriptors? I only have a year of bioinformatics experience under my belt, and the learning curve has been steep for me in terms of processing more complex datasets with text processing programming language. Any info or sample scripts that are used to process kissplice/kissplice2refranscriptome outputs would be greatly appreciated.

Cheers, and thanks again for all of the insight and prompt replies, I really appreciate it.

M

0
Entering edit mode
3.6 years ago

Dear TrentGenomics,

Does kissDE finished running? I guess you were on the right track, once you get kissDE output you might have your answer.

You can try to process .tsv files using spreadsheet processors like Excel and LibreOffice Calc.

Cheers.

0
Entering edit mode

No problem!

I used a perl script to filter out SNPs of interest.

Thanks for getting back to me.

0
Entering edit mode

Dear TrentGenomics,

Do you have any more doubts or are all of them solved? Did kissDE suit your needs or your question needs another tool?

Thanks.

Traffic: 2473 users visited in the last hour
FAQ
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.