Hello,
I kind of stuck in a situation. Where i need to compare two annotations of a single Fungal genome.
Genome 1
- Downloaded .fa and annotation from NCBI (Initially published in 2014)
- Genome sequenced using Illumina and assembled using
ALLPATHS-LG
keeping min contig size > 1000bp resulting in total 168 supercontigs - for the annotation authors used RNASeq data which contains
- 15 strand-specific and paired read datasets
- 18 non-stranded and paired RNA-Seq datasets
- 4 EST datasets
- In short authors used RNASeq data sets of multiple fungal genus.
- for annotation authors Implied following process
- Transcriptome assembly using
Trinity
-> Transcriptome alignment to genome usingPASA
-> gene models using predictions fromGeneMarkES, GeneId, Augustus, GlimmerHMM, and SNAP
- Transcriptome assembly using
- final annotation on NCBI contains ~17642 genes of which ~17280 are protein coding
Now I got the exact same strain which we resquenced
Genome-2
- Sequenced using Nanopore R10 and Illumina ShortRead
- Assembled using
Flye
and polished with Shortread data. keeping min contig size > 5000bp resulting in 68 scaffolds (11 chromosomes and remaing unplaced scaffolds) - Repeats annotated using
RepeatModeler/Masker
- For annotation we used RNASeq data specifically from this strain with inhouse experiments
- Strain was used to do infection assays and RNA was extracted/sequenced (4-replicates)
- Similarly, Non infection strain control (4-replicates)
- FLask grown strain RNA (4-replicates)
- I used all this data and followed the
Funannotate
pipeline fo annotate the genome. Which used similar tools in the annotation process. - My final annotation has ~14170 genes and ~14968 proteins predicted
As you can see there is a difference of around 3000 genes/proteins in my annotation and NCBI annotation.
My Question
- What can be sourse of this difference ? My guess is "its because in NCBI genome, authors used a wide verity of RNAseq data but i only used a small strain specific data"
- How can I compare the two annotations, to see what is similar and what is defferent, Any tool/pipiline or process that can help in this regrad.
- Can this be because of the Contigs cutoff size ? as far as i know the average gene size is around ~1838.69 bp in my annotation and my supervisor also mentioed that there wont be any genes under 5000 bp.
I am stuck on this process and cant seem to find any answer. Your input will be a great help.
Regards
Hi GenoMax
Thank you for detailed reply.
I understand all the points you have mentioned, those can be a potential source of this difference.
Right now my main concers is how can i compare them as 1-vs-1 of annotations so i can identify which are the extra genes in NCBI-annotation (Ref-Seq).
As both annotations have different Gene-IDs/Protein-IDs
You could consider to run some BLAST analyses between the two annotation results. All vs all on protein and CDS level for instance. That way you'll relatively easily identify similarities and differences between the two sets.
BLAST is a standard option but
mmseq2
may be more efficient for this task: https://github.com/soedinglab/MMseqs2In any case there will be some manual inspection needed to check on the results.