Tool:DiscoSnp++ 2.1.2 release: now genotypes and creates VCFs
1
5
Entering edit mode
7.2 years ago

DiscoSNP++ is a reference-free SNP/indel discovery tool.

From version 2.1.2:

1/ discoSnp++ generates a VCF as output:

• Without mapping positions if no reference genome is available
• With mapping positions else. In this latter case, discoSnp++ uses bwa for mapping.

2/ discoSnp++ computes genotypes from predicted coverages of variants. This predictions are reported both in the fasta output and in the VCF file.

As usual any comment or feedback (negative or positive :)) is welcome.

Pierre

discosnp SNP indel genotyping Tool • 2.2k views
0
Entering edit mode

Hello, compilation of latest version fails for some reason f(irst time I try to compile). Any ideas? http://pastebin.ca/2962771 I am running centos 64bit

0
Entering edit mode

Hi,

Are you using clang ?

In this case, could you test to use gcc instead?
This would need to avoid to use the automatic compiler (./compile_discoSnp++.sh). For instance if you disposes from gcc 4.9 installed;

rm -rf build
mkdir build
cd build
cmake -DCMAKE_C_COMPILER=gcc-4.9 -DCMAKE_CXX_COMPILER=g++-4.9 ..
make
cd ..


Best, Pierre

1
Entering edit mode

Hello, that worked! after cmake

cmake -DCMAKE_C_COMPILER=/opt/centos/devtoolset-1.0/root/usr/bin/gcc -DCMAKE_CXX_COMPILER=/opt/centos/devtoolset-1.0/root/usr/bin/g++ ..

i did make. Where can I find binaries? Thanks!

0
Entering edit mode

nice to know, thanks.

Binaries are in the ROOT/build/tools/ directory.

However, note that the principal script is ROOT/run_discoSnp++.sh.

Pierre

0
Entering edit mode

Quick question (is it ok to ask questions here or should I make a new post?). I am a bit confused about the difference between variation found in the coherent and uncoherent files. I thought it was really based on coverage. I have the following example of 2 sets of coherent sequences:

>SNP_higher_path_20|P_1:30_T/G|high|nb_pol_1|C1_114|C2_97|C3_96|C4_92|C5_88|G1_0/0:9,347,2284|G2_0/0:9,296,1944|G3_0/0:8,293,1924|G4_0/0:12,251,1808|G5_0/1:144,133,1581|Q1_71|Q2_70|Q3_70|Q4_68|Q5_66|rank_0.32756 >SNP_lower_path_20|P_1:30_T/G|high|nb_pol_1|C1_0|C2_0|C3_0|C4_2|C5_16|G1_0/0:9,347,2284|G2_0/0:9,296,1944|G3_0/0:8,293,1924|G4_0/0:12,251,1808|G5_0/1:144,133,1581|Q1_0|Q2_0|Q3_0|Q4_47|Q5_61|rank_0.32756

>SNP_higher_path_178|P_1:30_A/C|high|nb_pol_1|C1_7|C2_10|C3_21|C4_3|C5_0|G1_1/1:1644,187,48|G2_1/1:1653,170,76|G3_0/1:1907,152,191|G4_1/1:2071,279,16|G5_1/1:2044,311,9|Q1_58|Q2_57|Q3_51|Q4_49|Q5_0|rank_0.22405 >SNP_lower_path_178|P_1:30_A/C|high|nb_pol_1|C1_87|C2_89|C3_107|C4_106|C5_102|G1_1/1:1644,187,48|G2_1/1:1653,170,76|G3_0/1:1907,152,191|G4_1/1:2071,279,16|G5_1/1:2044,311,9|Q1_65|Q2_61|Q3_63|Q4_67|Q5_68|rank_0.22405

and then these 2 sets are considered uncoherent:

>SNP_higher_path_69|P_1:30_T/G|high|nb_pol_1|C1_13|C2_14|C3_9|C4_3|C5_0|G1_1/1:1393,123,115|G2_1/1:1520,134,123|G3_1/1:1761,190,64|G4_1/1:1614,213,18|G5_1/1:2164,329,9|Q1_58|Q2_62|Q3_57|Q4_59|Q5_0|rank_0.21337 >SNP_lower_path_69|P_1:30_T/G|high|nb_pol_1|C1_77|C2_84|C3_94|C4_83|C5_108|G1_1/1:1393,123,115|G2_1/1:1520,134,123|G3_1/1:1761,190,64|G4_1/1:1614,213,18|G5_1/1:2164,329,9|Q1_65|Q2_64|Q3_66|Q4_68|Q5_68|rank_0.21337

>SNP_higher_path_39|P_1:30_T/G|high|nb_pol_1|C1_12|C2_13|C3_13|C4_2|C5_0|G1_1/1:1615,155,98|G2_1/1:1761,169,105|G3_1/1:1645,154,108|G4_1/1:1748,242,12|G5_1/1:1724,263,8|Q1_51|Q2_61|Q3_60|Q4_43|Q5_0|rank_0.19546 >SNP_lower_path_39|P_1:30_T/G|high|nb_pol_1|C1_88|C2_96|C3_90|C4_89|C5_86|G1_1/1:1615,155,98|G2_1/1:1761,169,105|G3_1/1:1645,154,108|G4_1/1:1748,242,12|G5_1/1:1724,263,8|Q1_66|Q2_66|Q3_65|Q4_67|Q5_68|rank_0.19546

Do these provide any clue as to why some are coherent and why some are not? Because when I look at the coverage of the minimal variant it is not very obvious to me.

Thank you!

1
Entering edit mode

(A new post would be best for this question.. but it's alright!)

I'll let Pierre confirm the following guess: coherent/incoherent is indeed based on coverage, but one should distinguist k-mer coverage with k-read-coverage (as defined is discosnp paper). A SNP can be k-read-incoherent, meaning it might be an assembly artefact that the reads cannot explain, yet the whole path might have a sufficient mean coverage of individual kmers.

1
Entering edit mode
7.2 years ago

The read coherency is computed as follows:

The reads of each read set are mapped back on each prediction (authorizing by default one mismatch anywhere but not on the variant(s) position(s).

For each read set, each predicted sequence is said  "k-read-covered" if all kmers of this sequence are covered by at least c reads (c being a main parameter, =4 by default).

Variants for which the two sequences are not k-read-coherent for all read sets are declared "uncoherent".

The read coverage indicated in the outputs (.fa and .vcf) are the sum of number of the read mapped. Thus this is possible to have a high read coverage for uncoherent variants: this means that a lot of reads mapped some parts of the sequence but that some other regions of the sequence are not mapped.

Best, Pierre.

0
Entering edit mode

Thank you for your answer, clears some things up. I have started a new thread with followup questions, DiscoSnp++, question about how SNPs called.