Question

Annotation comparison files on NCBI of gene assembly

0

Entering edit mode

7 weeks ago

zirui • 0

Hello everyone. I have recently started a project as a beginner bioinformatician. At my current phase, I need to evaluate a quality of gene annotations of a reference assembly that's newly added on NCBI, and possibly enrich the annotation myself.

I found that there exit annotation_comparison files for some assembly on refseq where the annotation of a specific assembly is compared against a previous one. However, there isn't much documentation about the value associated with certain columns. I can parse out the meaning of some values but not enough to use the data to the extent of my liking.

https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/635/GCF_000001635.27_GRCm39/Annotation_comparison/ https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/048/771/995/GCF_048771995.1_bTaeGut7.mat/Annotation_comparison/

For example, I find that in column one (gene category), there exists the following possible values and their respective occurrence

Changed completeness - 73
Changed locus ID - 226
Changed locus type - 1032
Changed substantially - 7807
Current-novel - 3080
Current-other - 750
Current-unmapped
Identical - 1433
Merged - 247 
Other - 292
Previous-novel - 3621
Previous-other - 807
Previous-unmapped - 242
Similar - 50019
Split - 104

While I can understand what each of the values mean vaguely, I have trouble to really understand them and the process each gene was assigned to their respective category. May I ask if there exist a documentation that explains this further?

Cheers

annotation • 1.0k views

ADD COMMENT • link updated 6 weeks ago by GenoMax 153k • written 7 weeks ago by zirui • 0

score 2 · Accepted Answer · 2025-07-18

2

Entering edit mode

7 weeks ago

GenoMax 153k

May I ask if there exist a documentation that explains this further?

There is a file that describes the specific categories: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/635/GCF_000001635.27_GRCm39/README.txt

I won't copy the section here since it is long, but look for *_compare_prev.txt.gz under "description of files" in the file above.

I can parse out the meaning of some values but not enough to use the data to the extent of my liking.

What is your specific interest.

ADD COMMENT • link 7 weeks ago by GenoMax 153k

0

Entering edit mode

Thank you so much for pointing me to the this file. It definitely helped a lot.

My specific interest is tissue-specific RNA seq. Currently I am trying to evaluate the annotation of a newly assembled reference genome of a less common model organism. Eventually I would like to decide whether the annotation is of good quality for RNA-seq and/or identify the weakness/gap in the annotation and then annotated it myself.

ADD REPLY • link 6 weeks ago by zirui • 0

0

Entering edit mode

Annotation is probably going to be a secondary issue you will likely be facing. You may need to find/build a comprehensive transcriptome as a first step.

ADD REPLY • link 6 weeks ago by GenoMax 153k

0

Entering edit mode

I see. Our lab would be outsourcing the spatial RNA-seq to 10XGenomic. However, the preprocessing pipeline of 10XGenomic requires a GFF file of the reference genome of the animal, which is why I am looking to evaluate the annotation first. May I ask why would the transcriptome comes first? Would I not need an annotation first to extract meaningful information from the RNA transcripts?

ADD REPLY • link 6 weeks ago by zirui • 0

0

Entering edit mode

Would I not need an annotation first to extract meaningful information from the RNA transcripts?

Yes but where is that transcriptome coming from? While you could use gene prediction tools on genome sequence (which NCBI must have already done), the "proof is in the pudding" so to speak. You need experimental expression data e.g. RNAseq, to know which transcripts are real and are being expressed. So prediction and the experimental confirmation go hand in hand to produce good annotation.

Annotation is the more difficult bit of the two. While some things can be done automatically others need careful manual examination/curation. $1K genome needs $100K in annotation effort, which people forget.

If you are working with a model genome then this has been addressed already. More of an issue if you are working with something unusual.

ADD REPLY • link 6 weeks ago by GenoMax 153k

0

Entering edit mode

Thank you for your explanation. The transcriptome data is coming from 10X where we send the tissue to them and they give us the spatial transcriptomic data of the tissue. The reason why I am concerned about the annotation so much is because on the website of the 10Xgenomic

it says

…Visium HD 3' assay's poly-A capture method offers broad compatibility across a wide range of vertebrates, Space Ranger's downstream bioinformatics processing relies heavily on these reference files.

A high-quality GTF is crucial for correctly assigning captured reads to their respective genes, avoiding misattribution of reads to incorrect or nonexistent features. Poorly annotated or incomplete GTF files can lead to ambiguous read assignments, reducing the usable data and potentially misrepresenting gene expression profiles.

10x Genomics therefore strongly recommends using the most up-to-date and thoroughly curated GTF and FASTA files available for your organism of interest.

Currently, I am trying to see if the newest RefSeq Zebra Finch genome annotation is up to snuff for the Spatial Transcirptomics

ADD REPLY • link 6 weeks ago by zirui • 0

0

Entering edit mode

The transcriptome data is coming from 10X

Yes but in the form of short stretches (50 bp) of RNA reads. It is not going to be transcriptome as in full length transcripts. Because of the limitations of the technology only about ~15-25% of the mRNA is likely going to be captured/represented in your spatial data. Since these reads are short they can easily multi-map so having a good transcriptome is noted as essential.

RefSeq assembly of Zebra Finch should be fine for analysis, since it is not v.1.x release, and should be mature enough.

ADD REPLY • link 6 weeks ago by GenoMax 153k

0

Entering edit mode

Thank you so much for your response and your patience. Do you mind if I ask for more questions? My specifc interest is to use Visium HD HD 3’ Gene Expression assay to gather spatial transcriptomic data the tissue. The data contains spatial barcodes which can be integrated with imaging data to segment the individual cells body. The goal is to determine molecular markers of the cells and their spatial organization within the tissue.

My understanding is that RefSeq annotations provides associated RNA transcripts information that is either predicted or curated. Would a separate whole transcriptome data independent of the Visium-HD data be necessary? My guess is that whether an assembled transcriptome is needed is dependent on the quality of the annotation of Reference Genome, is that correct?

Currently in zebra finch reference genome, the majority of the transcripts is only predicted (XR_ prefix), which 46,667 out of 47,631. How can one circumvent this?

I was under the impression that most NGS RNA-seq technology uses fragments of transcripts. If I have a bunch of fragmented RNA transcripts from VISIUM, could I technically assemble complete transcripts from the fragments?

Could I ask for recommendation for literature regarding the pitfalls and best practice of spatial transcriptomics? If you don't mind could I ask more follow-up questions later-on?

ADD REPLY • link 6 weeks ago by zirui • 0

0

Entering edit mode

Would a separate whole transcriptome data independent of the Visium-HD data be necessary? My guess is that whether an assembled transcriptome is needed is dependent on the quality of the annotation of Reference Genome, is that correct?

For your analysis the RefSeq transcriptome should be adequate. Ensembl also has some annotations available (https://ftp.ensembl.org/pub/release-114/fasta/taeniopygia_guttata/cdna/Taeniopygia_guttata.bTaeGut1_v1.p.cdna.all.fa.gz ) you can check to see if they are better. If you decide to use them get the genome sequence from there as well so everything stays matched.

Currently in zebra finch reference genome, the majority of the transcripts is only predicted (XR_ prefix), which 46,667 out of 47,631. How can one circumvent this?

There is nothing to circumvent. If you find genes of interest from this set you would use those accessions. There are public bulk RNAseq datasets for the finch e.g. https://www.ncbi.nlm.nih.gov/bioproject/PRJNA1124911 You could look in those data to see if you are able to see the expression of genes of interest, as an independent confirmation.

If I have a bunch of fragmented RNA transcripts from VISIUM, could I technically assemble complete transcripts from the fragments?

Not with the 3' Visium assay, which is going to capture that end. And likely not with spatial data, which tends to be sparse.

ADD REPLY • link 6 weeks ago by GenoMax 153k