Question: How To Distinguish Between Heterozygous Or Duplicated Allels
gravatar for Leszek
9.6 years ago by
IIMCB, Poland
Leszek4.1k wrote:

I assemble diploid fungal genome (illumina PE 100bp reads, coverage in range 200-300x). I believe, the size of this genome is ~13 Mb, but assembly I got always is between 22-24Mb. I've used Velvet, SOAPdenovo using multiple parameters sets. Interestingly, when scaffolds are aligned against another from the same assembly, you will find around 1/3 of the genome aligns with 80-90% identity. We even sequenced additional insert size library, but results are similar.

How to decide, whether this scaffolds are duplicated or heterozygous allels?

ADD COMMENTlink written 9.6 years ago by Leszek4.1k
gravatar for Philippe
9.6 years ago by
Barcelona, Spain.
Philippe1.9k wrote:


I'm no expert in this field but it might be worth to look at what has been done to detect CNVs (Copy Number Variation) regions (the problems are somehow similar). One method is for example to look at the coverage of your genome. If you have a median coverage of 200x and some region have a coverage of 300x that might indicate that this segment is duplicated. I am sorry I have no references to share I just remember hearing this during some talks. There was some statistical methods to discriminate those regions.

I hope this has been at least a bit helpful.

ADD COMMENTlink written 9.6 years ago by Philippe1.9k
gravatar for lh3
9.6 years ago by
United States
lh332k wrote:

Are you sure the genome size is 13Mb? I would more believe the truth is around 20Mb. I do not work with fungus genomes, but it seems quite unlikely for two haplotypes from the same strain to have 10-20% divergence. If this is really true, there is almost no way to tell segmental duplications from different alleles.

Your best hope is to sequence an inbreed strain. Ploidy has caused quite a lot of problems to higher Eukaryotic genomes (e.g. Ciona and zebrafish) and should be worse for fungi. This is a long-existing problem in de novo assembly. If there were a simple solution, those smart people would have found that.

ADD COMMENTlink written 9.6 years ago by lh332k

That makes sense. On the other hand, I have not seen assembler do that bad on estimating the genome size. I used to get sanger data for a diploid fungus genome, the size estimate is quite good. Of course difference clades may have completely different stories.

ADD REPLYlink written 9.6 years ago by lh332k

we have some close relatives sequenced, but there is very weak similarity @ nucleotide level. All close species from that clade are having genomes in range 12-15Mb, so this is why I suspect duplication.

ADD REPLYlink written 9.6 years ago by Leszek4.1k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1737 users visited in the last hour