Question

Fungal Annotation Comparison

1

Entering edit mode

4 months ago

SomeOne ▴ 250

Hello,

I kind of stuck in a situation. Where i need to compare two annotations of a single Fungal genome.

Genome 1

Downloaded .fa and annotation from NCBI (Initially published in 2014)
Genome sequenced using Illumina and assembled using ALLPATHS-LG keeping min contig size > 1000bp resulting in total 168 supercontigs
for the annotation authors used RNASeq data which contains
- 15 strand-specific and paired read datasets
- 18 non-stranded and paired RNA-Seq datasets
- 4 EST datasets
In short authors used RNASeq data sets of multiple fungal genus.
for annotation authors Implied following process
- Transcriptome assembly using Trinity -> Transcriptome alignment to genome using PASA -> gene models using predictions from GeneMarkES, GeneId, Augustus, GlimmerHMM, and SNAP
final annotation on NCBI contains ~17642 genes of which ~17280 are protein coding

Now I got the exact same strain which we resquenced

Genome-2

Sequenced using Nanopore R10 and Illumina ShortRead
Assembled using Flye and polished with Shortread data. keeping min contig size > 5000bp resulting in 68 scaffolds (11 chromosomes and remaing unplaced scaffolds)
Repeats annotated using RepeatModeler/Masker
For annotation we used RNASeq data specifically from this strain with inhouse experiments
- Strain was used to do infection assays and RNA was extracted/sequenced (4-replicates)
- Similarly, Non infection strain control (4-replicates)
- FLask grown strain RNA (4-replicates)
I used all this data and followed the Funannotate pipeline fo annotate the genome. Which used similar tools in the annotation process.
My final annotation has ~14170 genes and ~14968 proteins predicted

As you can see there is a difference of around 3000 genes/proteins in my annotation and NCBI annotation.

My Question

What can be sourse of this difference ? My guess is "its because in NCBI genome, authors used a wide verity of RNAseq data but i only used a small strain specific data"
How can I compare the two annotations, to see what is similar and what is defferent, Any tool/pipiline or process that can help in this regrad.
Can this be because of the Contigs cutoff size ? as far as i know the average gene size is around ~1838.69 bp in my annotation and my supervisor also mentioed that there wont be any genes under 5000 bp.

I am stuck on this process and cant seem to find any answer. Your input will be a great help.

Regards

Annotation RNAseq Fungus genome_assembly funannotate • 2.0k views

ADD COMMENT • link updated 3 months ago by SanjidaR_NCBI ▴ 70 • written 4 months ago by SomeOne ▴ 250

score 3 · Answer 1 · 2025-06-03

I can think of three possible explanations:
a) The first annotation has some false positive predictions (which means some non-coding regions were incorrectly predicted as genes).
b) The first annotation has haplotypic duplications (which means some alleles were assembled separately instead of being collapsed into a single sequence).
c) The second annotation has some false negative gene predictions (which means there were unnoticed genes)

The main thing I would do is analyzing proteins from both annotations with BUSCO. The annotation that has higher BUSCO completeness ("C") and less orthogroups with duplicate genes ("D") is better.

I don't think that the contig length cutoff in the second genome assembly affected the annotation significantly, because based on the low number of scaffolds I suppose that the assembly is quite good, and, thus, all or almost all genes are in long contigs.

score 2 · Answer 2 · 2025-06-03

2

Entering edit mode

4 months ago

GenoMax 154k

This question is difficult to answer satisfactorily. But I will take a stab.

While it may be the same strain, technology keeps moving ahead and it is likely that the assembly you ended up with is better (mainly because you had long reads in mix). (We will keep the "strain" discussion aside since it is possible that the strain used for NCBI assembly may be significantly different than what you have in your lab. Whether that accounts for a difference for 3000 genes ... that seems a little excessive so more than likely that is due to gene prediction/annotation).

User submitted assemblies undergo the NCBI annotation process at some point and after manual curation they will end up getting a RefSeq version of the genome. You don't say if the annotation you used from NCBI for comparison was RefSeq version. If not, you should definitely download RefSeq version for comparison of annotation.

While NCBI's Eukaryotic annotation pipeline is available for download, it currently says that fungi are not supported. You could instead use the Request Annotation tab at the top of https://www.ncbi.nlm.nih.gov/refseq/annotation_euk/process/ page and see if NCBI would be willing to run your assembly through their pipeline. That may be one way of getting an annotation that can be more directly compared. Other would be for you to annotate the NCBI assembly using funannotate.

ADD COMMENT • link 4 months ago by GenoMax 154k

0

Entering edit mode

Hi GenoMax

Thank you for detailed reply.

I understand all the points you have mentioned, those can be a potential source of this difference.

Right now my main concers is how can i compare them as 1-vs-1 of annotations so i can identify which are the extra genes in NCBI-annotation (Ref-Seq).

As both annotations have different Gene-IDs/Protein-IDs

ADD REPLY • link 4 months ago by SomeOne ▴ 250

0

Entering edit mode

You could consider to run some BLAST analyses between the two annotation results. All vs all on protein and CDS level for instance. That way you'll relatively easily identify similarities and differences between the two sets.

ADD REPLY • link 4 months ago by lieven.sterck 16k

0

Entering edit mode

BLAST is a standard option but mmseq2 may be more efficient for this task: https://github.com/soedinglab/MMseqs2

In any case there will be some manual inspection needed to check on the results.

ADD REPLY • link 4 months ago by GenoMax 154k

score 1 · Answer 3 · 2025-06-03

There are many reasons for this discrepancy as others have said

consider that different technologies are behind the assemblies. Long vs short reads. Long reads lead to less fragmented assemblies and annotations and are more accepted these days (standard).
try reannotating both assemblies with helixer (they have a web-service too) to enable apples vs apples comparisons
BUSCO is a good idea as @shelkmike said
Use a tool like proteinortho to do the blasts (mentioned by lieven.sterck and get a comparative summary table out to compare the common or singleton genes

score 1 · Answer 4 · 2025-07-30

I think it would be helpful to have a sequence alignment of the two assemblies, so you can see if the differences in annotation are reflected in differences in sequence. Conventional BLAST is probably not going to be able to handle an entire chromosome or assembly, so you'll need something like minimap2 or lastz.

If you submit the second assembly with your annotation to GenBank, NCBI may agree to create an assembly alignment for you using their inhouse pipeline. They have a visualization tool, CGV, where you would be able to compare gene alignment and annotation between two assemblies.

See here: https://www.ncbi.nlm.nih.gov/cgv/browse/GCF_000149385.1/GCF_000091045.1/44095/283643#NC_009181.1:623768-645538/NC_006687.1:642109-663879/size=10000

They can also provide you with the alignment file for your own analysis or visualization.