Question: Diploid genome gene annotation
gravatar for mss
18 months ago by
mss30 wrote:

Hi guys,

I am working with several diploid fungal genomes but confused on how I deal with the duplicated genes. I started by assembling the diploid genomes using dipSPAdes, then gene finding with Maker. The reported number of genes is ~20,000 genes, which is about 2x as many genes that are reported for haploid genomes of the same genus. So my question is, is it okay to go forward with functional gene annotation, or do I need to somehow get rid of the duplicate genes in the genome? I have been confused about whether it is appropriate to publish the diploid version of the genome, or if it is necessary to report the haploid version. I hope this makes sense!

Thanks in advance, Morgan

annotation diploid gene genome • 385 views
ADD COMMENTlink written 18 months ago by mss30

Is the species you're working on (highly) heterozygotic? If not then dipSPADES is unfortunately not the most appropriate choice of assembler software.

ADD REPLYlink written 18 months ago by lieven.sterck8.6k

I do not believe so, (but also not 100% sure) because they exhibit both haploid and diploid cells. I personally observed this under the microscope after staining with DAPI. Also, the diploid genome was much higher quality in terms of number of contigs, size, and N50 when I compared the assembly to the regular SPAdes assembly. BUSCO confirmed that approximately 70% of the single copy orthologs were duplicated. Do you recommend a certain assembler so that I could compare them?

ADD REPLYlink modified 18 months ago • written 18 months ago by mss30

It's true that is not common to find diploid genome annotation within databases. I don't think the EBI or NCBI submission pipeline will make any difference whether it is a haploid or diploid annotation. But you should contact them to know what would be the best way to submit your data. I'm looking forward to hearing more about it. One of the problem I could see is that the alleles of a locus have two different gene identifiers in your MAKER annotation. So it means then you will have two loci identifiers for only one locus... So it would be bit wierd ...

ADD REPLYlink modified 18 months ago • written 18 months ago by Juke344.7k

Thanks for the advice, I'll contact the databases and can post an update here. I should have thought about this sooner before proceeding with assembly and annotation :/ I just wonder if there is a way to "fix" this with the gene predictions instead of having to start from the beginning with the assemblies.

ADD REPLYlink written 18 months ago by mss30

If you know which contigs are part of which assembly (primary or secondary) then it's not a problem to filter your annotation.

ADD REPLYlink written 18 months ago by Juke344.7k

That is good to hear. Do you recommend any program that can do this? Would it basically be some sort of alignment program that can detect the duplicated genes?

ADD REPLYlink written 18 months ago by mss30

Usually it is your assembler that would give you the phased genome. But I don't know how look the dipSPAdes outputs.

ADD REPLYlink written 18 months ago by Juke344.7k


I decided to go forward with the haploid genomes instead, so I used purge_haplotigs pipeline to do so. The genomes were greatly reduced in size and annotations.

ADD REPLYlink written 6 months ago by mss30
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1892 users visited in the last hour