Question

discoSNP - Many bubbles with 0 coverage

0

Entering edit mode

8.9 years ago

ruthmiller84 • 0

Hi,

I am following on from a previous email thread, copied in below. Regarding the point

However, with our Giardia data, we are finding that many bubbles have zero coverage in at least one sample.

We are finding a surprisingly high number of bubbles with zero coverage in at least one sample. For example, in a group of 39 samples, where according to MUMmer the average similarity between samples is 93.8%, we are finding that 35% of bubbles contain samples with zero depth. And even after removing the three most divergent samples, 20% of bubbles still contain at least one sample with zero depth.

This is surprising for us, and we were wondering whether there might be an explanation, or perhaps a setting in discoSNP we can tweak to reduce this.

Our ultimate aim is to produce phylogenetic trees of the data, and with this amount of bubbles containing samples with 0 depth (which is converted to an N) we are loosing a lot of information for the tree.

Thanks in advance,
Ruth

Hi,

The data we are analysing are Giardia genomes. This adds an extra level of complexity because Giardia has two nuclei, thus is polyploid. There are therefore some positions where a bubble might have more than 2 options, e.g. there might be A, C or G in the same position. We chose to use discoSNP because it allows complex bubbles, which we thought might overcome this.

In this situation disco detects the A/C, the A/G and the C/G couple of "alleles". It is not aware that this is the same locus.

However, with our Giardia data, we are finding that many bubbles have zero coverage in at least one sample.

Is that unexpected?

This is making post-processing (e.g. Phylogenetic tree drawing) rather difficult, because once you remove bubbles with 0 coverage in a sample (which we convert to Ns) we loose a lot of information, and therefore cannot resolve between many of the samples.

I'm not sure to understand correctly: you expect all SNPs variant to occur in all samples but with various coverages always bigger than zero?

We were wondering if there is any way we could reduce the number of very low coverage bubbles? Or perhaps group bubbles that are the same, but polyploid variants, as we thought this could be one explanation for the low coverage bubble, if the same polyploid variant is represented in more than one bubble.

We are exactly working on this topic for improving discoSnp. Our actual solution is to assemble the discoSnp outputs in order to map the predictions back to the so-created assemblies enabling to 1/ reduce redundancies 2/ detect tri-allelic loci.

Secondly, we were wondering whether you have had any success drawing phylogenetic trees from discoSNP output. And if so, what is your pipeline for doing this?

Honestly no. This is one of our future expectations, and this is why we are going through the close SNPs and indel detections. But we still don't have any feedback nor personal experiments in this direction.

By the way, for next remarks / requests: could you please use the biostar forum using the "discoSnp" tag. This enable to concentrate all bugs and requests at a unique location and to inform the whole discoSnp team at once.

Have a nice day,
Pierre.

Thanks in advance,
Ruth

From: Pierre Peterlongo pierre.peterlongo@inria.fr

Date: Monday, May 4, 2015 at 1:18 PM

To: "Dooley, Damion" Damion.Dooley@bccdc.ca, Ruth Miller ruth.miller@bccdc.ca

Cc: "Hsiao, William" mailto:William.Hsiao@bccdc.ca

Subject: Re: discoSNP cpu usage

Hi,

A new version of discoSnp++ should meet your needs about the thread number limitation. https://colibread.inria.fr/software/discosnp/

Let me know in case of problems.

Best,
Pierre

Le 27/04/15 18:44, Dooley, Damion a écrit:

Thanks a lot! We have a few DiscoSnp fans now so this will help give them all elbow room on our multi-core server. I'll check to see if any need the fast-lane or if they can wait - probably the latter.

I'd sent the same inquiry via the online discussion forum but you can disregard that now. Glad the vay-cay was refreshing.

Regards,
Damion

Hsiao lab, BC Public Health Microbiology & Reference Laboratory, BC Centre for Disease Control
655 West 12th Avenue, Vancouver, British Columbia, V5Z 4R4 Canada

From: Pierre Peterlongo [pierre.peterlongo@inria.fr]

Sent: Monday, April 27, 2015 9:15 AM

To: Miller, Ruth

Cc: Dooley, Damion

Subject: Re: discoSNP cpu usage

Hi,

Sorry for this long delay, I was on (great) holidays :)

I prepared a quick and dirty solution for you. You may update of the run_discoSnp++.sh and the run_VCF_creator.sh file with the one attached.

One of the first lines of the run_discoSnp++.sh file is

nb_cores=1

You may change this value in the file (no option yet). The three main tools (dbgh5, kissnp and kissreads) will limit the number of cores to this value.

These changes with real option will appear in a next incoming release.

If you're are not in a rush, maybe you can wait for a few more days. We detected a small bug in the VCF construction that should be fixed soon.

Best,
Pierre

Le 24/04/15 02:05, Miller, Ruth a écrit:

Hi,

I have been using discoSNP2.1.4 to call SNPs in my data, and am very pleased with the results.

I have been trying multiple settings, and hence running multiple instances of discoSNP at once. We are finding that discoSNP is running many of the tools in parallel, which is significantly slowing down our server.

I was wondering whether there is any way to limit the number of CPUs discoSNP uses?

Thanks,
Ruth

discoSNP • 2.0k views

ADD COMMENT • link updated 16 months ago by Ram 43k • written 8.9 years ago by ruthmiller84 • 0

Ram · Answer 1 · 2015-06-12

Hi Ruth,

Thanks for the message. It is difficult to give an answer without knowing well the datasets.

However, as discoSnp concentrates on differences between read sets (best ranked predictions) it may appear natural that among those predictions, some/many of them have large/extreme coverage difference between read sets.

The only technical bias I could see is a lack of sensibility while remapping the reads (the kissreads phase), that could miss some alignment if the predictions and the read differ too much. Maybe you could try increasing the authorized distance (-d option of run_discoSnp++.sh script) and check if the number of prediction non covered in a read set decreases.

Else, except by filtering by yourself predictions that do not span all datatsets, I do not see how you could get rid off this "problem".

I hope this helps.

Pierre