Question

Do fasta files at ftp.ncbi ever change once uploaded?

0

Entering edit mode

7.8 years ago

andrewsanchez ▴ 10

The NCBI FAQ states:

Only FTP files for the "latest" version of an assembly are updated when annotation is updated, new file formats are added or improvements to existing formats are released.

It also states

Any changes to the sequences included in a particular assembly accession result in an increment of the assembly version, which means that an assembly accession.version (e.g. GCF_000001405.28) represents a fixed set of sequences.

This is the point I am confused about: When a file is "updated," i.e. "updated when annotation is updated," is it treated as a "changed," file whose version number will then be incremented? I feel NCBI's wording is ambigous. Is there a difference between a changed vs. updated file?

To clarify: when files are updated, is the version number always incremented? Or are files sometimes updated without incrementing the version number? That is, do any and all changes result in a change of the filename, i.e. incrementing the version number.

I'm wondering if I can detect updated files based solely on the file name due to the version number being incremented.

The reason I am wondering: If, files are updated without changing the filename via incrementing the version number, then rsync is the way to go.

If not, then the task is much simpler and quickly accomplished since I would just need to worry about getting newly uploaded files based on their filename with wget.

NCBI FASTA • 1.4k views

ADD COMMENT • link 7.8 years ago by andrewsanchez ▴ 10

1

Entering edit mode

You may want to stick with genomes in RefSeq section which you can find here (bacteria directory) : ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/

Here is an explanation of how accession number and version numbers are handled by NCBI.

ADD REPLY • link 7.8 years ago by GenoMax 141k

0

Entering edit mode

7.8 years ago

natasha.sernova ★ 4.0k

See my answer inside this post:

where can I get environmental bacteria genome in fasta format (as many as possible)?

See README files for more information - what has been changed etc.

ADD COMMENT • link 7.8 years ago by natasha.sernova ★ 4.0k

score 1 · Accepted Answer · 2016-07-12

The definitive answer seems to be found in the link discussing sequence ID's provided by @genomax2

Note that the gi number doesn't change every time the record is modified. Only changes to the sequence data trigger assignment of a new gi; minor updates are tracked, but don't change the gi or version number. But note that every time the gi changes, the version number is incremented. That is, when any change is made to a sequence, it both receives a new GI number, and the version part of its accession number is incremented by 1.

This suggests that rsync shouldn't be necessary unless one wishes to track every minor update.