Question

Determine ploidy using SNP-Calling

0

Entering edit mode

7.2 years ago

dilution • 0

Hi everyone,

I am currently trying to determine the ploidy of our organism (protist). First, I introduce the approach I used in order to provide elements as much as possible.

1 * From a previous genes annotation, I have selected genes that are considered as unique, in other word I aimed to discard paralogs. To do it, I did a blastp of the proteic sequences against itself. Then, I filtered the output selecting sequences which only have a hit with themselves as significant hit. I confess that I'd miss some TRUE negativ but we do not want to do an extensive analysis of ploidy. I mean, I assume that with few unique gene and by discarding paralogs which could add bias, I would be able to easily determine ploidy.

2 * So, I have finally selected 700 genes which have known annotation.

3 * The next step was to map the reads to the reference genome. I used Bowtie2 with end2end strategy.

4 * Then, I called variant with Freebayes and GATK(UnifiedGenotyper) with the ploidy option set at 10.

5 * From both output, I have only checked the variant which occur in the exons.

6 * Finally, I observed the distribution of the allele frequency for the common SNP called by both tools.

In case of diploidy, I expect to observe a distribution with a maximum at 0.5 and a bell distribution.

In case of triploidy, I expect to observe a distribution with a peak at 0.3 and another one at 0.6. And so on.

But, here I have a distribution with a bell curve, around 0.3/0.4 . Do you have any suggestion about it?

Moreover, I also notice that on all the exons of the unique genes 85/90% of them do not carry any SNP.

thank you for your time. Ask me if I should precise more some information.

SNP mapping ploidy NGS • 3.2k views

ADD COMMENT • link 7.2 years ago by dilution • 0

score 2 · Answer 1 · 2017-02-08

2

Entering edit mode

7.2 years ago

Brian Bushnell 20k

Normally I try to determine ploidy based on a kmer-frequency histogram. With BBMap, you can do that like this:

khist.sh in=reads.fq khist=khist.txt peaks=peaks.txt

Then you'll usually see 2 prominent peaks for a diploid or one prominent peak for a haploid, when plotted on a log-log scale. Peaks.txt will try to guess the ploidy from the kmer histogram but you will be able to do a better job yourself visually if the peaks are not very obvious. In this case, if you have an extremely low heterozygosity rate, the first peak might be very small.

ADD COMMENT • link 7.2 years ago by Brian Bushnell 20k

0

Entering edit mode

Indeed, I can use the k-mer approach, too. THank, I am on the way to do it.

ADD REPLY • link 7.2 years ago by dilution • 0

0

Entering edit mode

So,

in the peaks.txt file, I got this :

k 31 unique_kmers 312426596

main_peak 16

genome_size 231467812

haploid_genome_size 231467812

fold_coverage 16

haploid_fold_coverage 16

ploidy 1

percent_repeat 25.689

start center stop max volume

6 | 16 | 40 | 14044386 | 172006896

166 | 168 | 296 | 12145 | 1033976

296 | 299 | 333 | 5161 | 174263

333 | 338 | 366 | 4396 | 137310

366 | 375 | 555 | 4123 | 534877

555 | 566 | 589 | 2034 | 63591

589 | 591 | 618 | 1852 | 48548

618 | 626 | 653 | 1652 | 51675

653 | 655 | 672 | 1386 | 24828

672 | 682 | 1705 | 1250 | 524076

Do you have any documentation about this tabular output? I wonder how can I assess these numbers.

I also plotted x = log(depth) against Y = Raw count from the khist.txt file.

I observed only one prominent peak but I am not confident that I did the right plot. Is all the documentation is there http://seqanswers.com/forums/showthread.php?t=41057 ?

ADD REPLY • link 7.2 years ago by dilution • 0

0

Entering edit mode

Actually this is a better resource:

http://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbnorm-guide/

...but it only mentions the peaks file briefly. The best thing to do is plot the 3 columns of khist.txt in a log-log scatter plot and post an image of it; normally the ploidy is obvious from visual inspection. Peaks.txt gives the program's best guess at the ploidy, and notes the most obvious peaks, but if your primary peak is actually at 16x as it indicates, the coverage is pretty low and the error kmers start interfering with the genomic kmers to make peak-calling more difficult.

ADD REPLY • link 7.2 years ago by Brian Bushnell 20k

0

Entering edit mode

http://imageshack.com/a/img923/5199/bPNYrJ.png

This is the plot x = log(depth) against Y = Raw count

http://imageshack.com/a/img924/8510/0oKuqE.png

This one is log-log plot

ADD REPLY • link 7.2 years ago by dilution • 0

0

Entering edit mode

It looks to me like the peaks.txt output is correct, and it's a haploid with primary peak at ~16x coverage. There's a tiny hump at what looks like twice that which is presumably the two-copy peak, but there's no evidence of a heterozygous peak at 8x.

ADD REPLY • link 7.2 years ago by Brian Bushnell 20k

0

Entering edit mode

Oh ok thank for the answer. I am still confused about the output. I mean it looks like a black box for me. ~16x, does it correspond to the maximum of the peak?

Is there any example of output for diploid organism? How would be the shape of the plot then? In the case of diploidy, would there be two peaks, one at half of the coverage of the first one? Or In the case of tetraploidy, would there be four peaks?

In order to evaluate the result, I was thinking to use Jellyfish to compare the results. Do you think it meaningful?

ADD REPLY • link 7.2 years ago by dilution • 0

0

Entering edit mode

Jellyfish will generate the same plot, but feel free to use it (be sure to use the -C flag). For a diploid you get two peaks, one at half the depth of the other. For a tetraploid I expect to see 4 peaks, though in practice I've only seen 3 peaks (one at 0.25x coverage, one at 0.5x coverage, and one at 1x coverage). Sorry, I don't have any pictures of them right now.

And yes, 16x corresponds to the top of the main peak, which is roughly your average coverage of the genome (technically it's the mode of the coverage).

ADD REPLY • link 7.2 years ago by Brian Bushnell 20k

0

Entering edit mode

You were right. I also observed one peak around 16x . I wonder now, as the coverage is a bit low if I can add reads from another sequencing. Previous one is Mi-seq and the potential additional one is Hi-seq. Is there a way to do it?

ADD REPLY • link 7.2 years ago by dilution • 0

0

Entering edit mode

For kmer counting, you can just concatenate all the fastqs together with cat and then run on the combined file.

ADD REPLY • link 7.2 years ago by Brian Bushnell 20k