I am following on from a previous email thread, copied in below. Regarding the point
"However, with our Giardia data, we are finding that many bubbles have zero coverage in at least one sample."
We are finding a surprisingly high number of bubbles with zero coverage in at least one sample. For example, in a group of 39 samples, where according to MUMmer the average similarity between samples is 93.8%, we are finding that 35% of bubbles contain samples with zero depth. And even after removing the three most divergent samples, 20% of bubbles still contain at least one sample with zero depth.
This is surprising for us, and we were wondering whether there might be an explanation, or perhaps a setting in discoSNP we can tweak to reduce this.
Our ultimate aim is to produce phylogenetic trees of the data, and with this amount of bubbles containing samples with 0 depth (which is converted to an N) we are loosing a lot of information for the tree.
Thanks in advance,
The data we are analysing are Giardia genomes. This adds an extra level of complexity because Giardia has two nuclei, thus is polyploid. There are therefore some positions where a bubble might have more than 2 options, e.g. there might be A, C or G in the same position. We chose to use discoSNP because it allows complex bubbles, which we thought might overcome this.
In this situation disco detects the A/C, the A/G and the C/G couple of "alleles". It is not aware that this is the same locus.
However, with our Giardia data, we are finding that many bubbles have zero coverage in at least one sample.
Is that unexpected?
This is making post-processing (e.g. Phylogenetic tree drawing) rather difficult, because once you remove bubbles with 0 coverage in a sample (which we convert to Ns) we loose a lot of information, and therefore cannot resolve between many of the samples.
I'm not sure to understand correctly: you expect all SNPs variant to occur in all samples but with various coverages always bigger than zero?
We were wondering if there is any way we could reduce the number of very low coverage bubbles? Or perhaps group bubbles that are the same, but polyploid variants, as we thought this could be one explanation for the low coverage bubble, if the same polyploid variant is represented in more than one bubble.
We are exactly working on this topic for improving discoSnp. Our actual solution is to assemble the discoSnp outputs in order to map the predictions back to the so-created assemblies enabling to 1/ reduce redundancies 2/ detect tri-allelic loci.
Secondly, we were wondering whether you have had any success drawing phylogenetic trees from discoSNP output. And if so, what is your pipeline for doing this?
Honestly no. This is one of our future expectations, and this is why we are going through the close SNPs and indel detections. But we still don't have any feedback nor personal experiments in this direction.
By the way, for next remarks / requests: could you please use the biostar forum using the "discoSnp" tag. This enable to concentrate all bugs and requests at a unique location and to inform the whole discoSnp team at once.
Have a nice day,
Thanks in advance,
From: Pierre Peterlongo <firstname.lastname@example.org>
Date: Monday, May 4, 2015 at 1:18 PM
Cc: "Hsiao, William" <William.Hsiao@bccdc.ca>
Subject: Re: discoSNP cpu usage
A new version of discoSnp++ should meet your needs about the thread number limitation. https://colibread.inria.fr/software/discosnp/
Let me know in case of problems.
Le 27/04/15 18:44, Dooley, Damion a écrit :
Thanks a lot! We have a few DiscoSnp fans now so this will help give them all elbow room on our multi-core server. I'll check to see if any need the fast-lane or if they can wait - probably the latter.
I'd sent the same inquiry via the online discussion forum but you can disregard that now. Glad the vay-cay was refreshing.
Hsiao lab, BC Public Health Microbiology & Reference Laboratory, BC Centre for Disease Control
655 West 12th Avenue, Vancouver, British Columbia, V5Z 4R4 Canada
From: Pierre Peterlongo [email@example.com]
Sent: Monday, April 27, 2015 9:15 AM
To: Miller, Ruth
Cc: Dooley, Damion
Subject: Re: discoSNP cpu usage
Sorry for this long delay, I was on (great) holidays :)
I prepared a quick and dirty solution for you. You may update of the run_discoSnp++.sh and the run_VCF_creator.sh file with the one attached.
One of the first lines of the run_discoSnp++.sh file is
You may change this value in the file (no option yet). The three main tools (dbgh5, kissnp and kissreads) will limit the number of cores to this value.
These changes with real option will appear in a next incoming release.
If you're are not in a rush, maybe you can wait for a few more days. We detected a small bug in the VCF construction that should be fixed soon.
Le 24/04/15 02:05, Miller, Ruth a écrit :
I have been using discoSNP2.1.4 to call SNPs in my data, and am very pleased with the results.
I have been trying multiple settings, and hence running multiple instances of discoSNP at once. We are finding that discoSNP is running many of the tools in parallel, which is significantly slowing down our server.
I was wondering whether there is any way to limit the number of CPUs discoSNP uses?