Question: NCBI genome version vs published genome version - what's 'better'?
1
gravatar for Biogeek
9 days ago by
Biogeek330
Biogeek330 wrote:

For organism of choice: A sea anemone (Exaiptasia), there are two versions of the genome available. The original published genome and the NCBI based genome.

The NCBI version has a reduced number of mRNAs and predicted peptides (~2000 less) compared to the original published genome files. I'm aware that when raw files are uploaded, NCBI run their Eukaryotic Genome Annotation Pipeline (splign, pro-splign, Genomon) and provide an 'updated' / 'their version' of the genome? I've also noted their are 5 'new' genes which the NCBI version has and the old genome doesn't when I use grep to compare presence and absence of geneIDs.

Come to doing RNA-seq alignment, I would assume it's best to use the most up-to-date NCBI version rather than the original published genome?

Sorry if this is a 'noob question' to ask.

Thanks.

ADD COMMENTlink written 9 days ago by Biogeek330
1

Depends what your objective is I think. I imagine the original sequence may have included some hand-curated annotations. The NCBI reanalysis is likely on the conservative side. If there are particular features of interest that are in the original but not the NCBI one, I'd say you're well within reason to use the original.

Most people, however, are going to use what's in NCBI for analyses, especially on large scale, so its likely that results will be more consistent with other papers and future publications if using the NCBI one.

In short, it depends on your priorities/questions I'd say.

ADD REPLYlink written 9 days ago by jrj.healey9.1k
1

Thanks for your time jrj.healey.

I imagined that the NCBI version had been more conservatively reviewed and curated - potential for discarding useful info. Now there is a Genbank and Refseq version. I think I will use V1.1 Genbank (GCA_) as that's the official files which the authors of the genome paper submitted. I suspect the authors have reviewed and done further redundancy removal between V1.0 and V1.1. The refseq version (GCF_), I'll pass on. From what I've visualised, the results do not change with respect to our experiment. It's been a worthwhile exercise delving into differences and how NCBI review/curate..

ADD REPLYlink written 9 days ago by Biogeek330

If there is a RefSeq version then you should use that. RefSeq entries are manually curated and should represent the best possible information available.

ADD REPLYlink modified 9 days ago • written 9 days ago by genomax59k

genomax. I agree BUT... what if this species has many clade-specific genes. Surely refseq will become a limiting database/ step??

ADD REPLYlink written 9 days ago by Biogeek330

If that is the case do you think even the GenBank record is going to be sufficient? You may have to do de novo assembly on anything from your sample that does not map to GenBank/RefSeq assembly to see if there are real/additional genes in your genome then.

ADD REPLYlink modified 9 days ago • written 9 days ago by genomax59k
1

It is not a 'noob question' and a real problem. It's good that you realise that. Most people do not pay attention to it. They don't realise that different annotations exist and just take the one by default that is the NCBI one.

That being said it's really hard to compare annotation and say which one is the best. You must be expert in the annotation domain to understand in details what has been done by the group that published the annotation (if they provide it in sufficient detail) and how works the NCBI pipeline and the data they used. Even knowing the pipelines/approaches in details it some cases it can be still very hard to guess which annotation is the best. In some cases NCBI is better in some others not. A colleague of mine was part of a project where they did a hugue work to check manually all the genes of an yeast annotation one by one (consoritum/jamboree) but once submitted, NCBI did an automatic annotation. At the end people use the NCBI annotation while the annotation that has been manually curated is much more trustable.

I would say that usually the published version is slightly better because expert of the species / or expert in specific gene families have often look into details the annotation and provided feedback while the NCBI do that in a more automated way that they maybe do not take into account some peculiarities.

In the other hand the first genome version could also have been done by a Ph.D. freshly recruited that launched a pipeline without knowing what he was really doing, the result was good enough to answer their scientific question and to publish a paper. But the annotation could be worse than a one done by good pipeline as the one used at NCBI.

ADD REPLYlink modified 9 days ago • written 9 days ago by Juke-341.7k

I would say that usually the published version is slightly better because expert of the species / or expert in specific gene families have often look into details the annotation and provided feedback while the NCBI do that in a more automated way that they maybe do not take into account some peculiarities.

If the genome was sequenced by a consortium of labs (or at least more than one lab that works on the organism) then that may be true. But as you rightly point out, if it was done by

a Ph.D. freshly recruited that launched a pipeline without knowing what is was really doing

then NCBI's version may be better (since at some level someone must have looked at the results before releasing it into database).

ADD REPLYlink modified 9 days ago • written 9 days ago by genomax59k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 836 users visited in the last hour