Hi everyone,
I am currently trying to determine the ploidy of our organism (protist). First, I introduce the approach I used in order to provide elements as much as possible.
1 * From a previous genes annotation, I have selected genes that are considered as unique, in other word I aimed to discard paralogs. To do it, I did a blastp of the proteic sequences against itself. Then, I filtered the output selecting sequences which only have a hit with themselves as significant hit. I confess that I'd miss some TRUE negativ but we do not want to do an extensive analysis of ploidy. I mean, I assume that with few unique gene and by discarding paralogs which could add bias, I would be able to easily determine ploidy.
2 * So, I have finally selected 700 genes which have known annotation.
3 * The next step was to map the reads to the reference genome. I used Bowtie2 with end2end strategy.
4 * Then, I called variant with Freebayes and GATK(UnifiedGenotyper) with the ploidy option set at 10.
5 * From both output, I have only checked the variant which occur in the exons.
6 * Finally, I observed the distribution of the allele frequency for the common SNP called by both tools.
In case of diploidy, I expect to observe a distribution with a maximum at 0.5 and a bell distribution.
In case of triploidy, I expect to observe a distribution with a peak at 0.3 and another one at 0.6. And so on.
But, here I have a distribution with a bell curve, around 0.3/0.4 . Do you have any suggestion about it?
Moreover, I also notice that on all the exons of the unique genes 85/90% of them do not carry any SNP.
thank you for your time. Ask me if I should precise more some information.
Indeed, I can use the k-mer approach, too. THank, I am on the way to do it.
So,
in the peaks.txt file, I got this :
k 31 unique_kmers 312426596
main_peak 16
genome_size 231467812
haploid_genome_size 231467812
fold_coverage 16
haploid_fold_coverage 16
ploidy 1
percent_repeat 25.689
start center stop max volume
6 | 16 | 40 | 14044386 | 172006896
166 | 168 | 296 | 12145 | 1033976
296 | 299 | 333 | 5161 | 174263
333 | 338 | 366 | 4396 | 137310
366 | 375 | 555 | 4123 | 534877
555 | 566 | 589 | 2034 | 63591
589 | 591 | 618 | 1852 | 48548
618 | 626 | 653 | 1652 | 51675
653 | 655 | 672 | 1386 | 24828
672 | 682 | 1705 | 1250 | 524076
Do you have any documentation about this tabular output? I wonder how can I assess these numbers.
I also plotted x = log(depth) against Y = Raw count from the khist.txt file.
I observed only one prominent peak but I am not confident that I did the right plot. Is all the documentation is there http://seqanswers.com/forums/showthread.php?t=41057 ?
Actually this is a better resource:
http://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbnorm-guide/
...but it only mentions the peaks file briefly. The best thing to do is plot the 3 columns of khist.txt in a log-log scatter plot and post an image of it; normally the ploidy is obvious from visual inspection. Peaks.txt gives the program's best guess at the ploidy, and notes the most obvious peaks, but if your primary peak is actually at 16x as it indicates, the coverage is pretty low and the error kmers start interfering with the genomic kmers to make peak-calling more difficult.
http://imageshack.com/a/img923/5199/bPNYrJ.png
This is the plot x = log(depth) against Y = Raw count
http://imageshack.com/a/img924/8510/0oKuqE.png
This one is log-log plot
It looks to me like the peaks.txt output is correct, and it's a haploid with primary peak at ~16x coverage. There's a tiny hump at what looks like twice that which is presumably the two-copy peak, but there's no evidence of a heterozygous peak at 8x.
Oh ok thank for the answer. I am still confused about the output. I mean it looks like a black box for me. ~16x, does it correspond to the maximum of the peak?
Is there any example of output for diploid organism? How would be the shape of the plot then? In the case of diploidy, would there be two peaks, one at half of the coverage of the first one? Or In the case of tetraploidy, would there be four peaks?
In order to evaluate the result, I was thinking to use Jellyfish to compare the results. Do you think it meaningful?
Jellyfish will generate the same plot, but feel free to use it (be sure to use the -C flag). For a diploid you get two peaks, one at half the depth of the other. For a tetraploid I expect to see 4 peaks, though in practice I've only seen 3 peaks (one at 0.25x coverage, one at 0.5x coverage, and one at 1x coverage). Sorry, I don't have any pictures of them right now.
And yes, 16x corresponds to the top of the main peak, which is roughly your average coverage of the genome (technically it's the mode of the coverage).
You were right. I also observed one peak around 16x . I wonder now, as the coverage is a bit low if I can add reads from another sequencing. Previous one is Mi-seq and the potential additional one is Hi-seq. Is there a way to do it?
For kmer counting, you can just concatenate all the fastqs together with cat and then run on the combined file.