Question: How to link the assembly accession with the chromosome accession for prokaryotic representative genomes
0
gravatar for wangdp123
9 months ago by
wangdp123180
Oxford
wangdp123180 wrote:

Dear colleague,

I am working on the analysis of prokaryotic genomes from NCBI genome database.

  1. Downloaded a file called prok_representative_genomes.txt from the following file ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/prok_representative_genomes.txt

After opening the file, we could see one column named "Chromosome RefSeq". (e.g., NZ_AQXM00000000)

  1. Download all protein sequences for all bacterial genomes from https://www.ncbi.nlm.nih.gov/assembly.

Each file has a name like "GCF_000834735.1_ASM83473v1_protein.faa.gz".

It is odd that the two datasets use different accession number system. In this case, how to identify if the genome of the proteins is annotated in the file "prok_representative_genomes.txt"?

My aim is to retrieve all the protein sequences for the genomes listed in the file "prok_representative_genomes.txt".

Thanks a lot,

Kind regards

Tom

prokaryotic genomes ncbi • 272 views
ADD COMMENTlink modified 9 months ago • written 9 months ago by wangdp123180

Thanks,

Does anybody know why there are some 0 values in the "Genomes" column of the file prok_representative_genomes.txt? For example, "Acetobacter aceti".

ADD REPLYlink written 9 months ago by wangdp123180

The README says it's manually updated. I tend to believe you'll see a more complete picture when using the prokaryotes.txt (computationally updated...)

ADD REPLYlink written 9 months ago by Carambakaracho1.6k
1
gravatar for vkkodali
9 months ago by
vkkodali1.2k
United States
vkkodali1.2k wrote:

In the same FTP path, there is another file prokaryotes.txt that can be helpful. Join the two files on the organism name (column 3 of prok_representative_genomes.txt and column 1 of prokaryotes.txt). Then, extract the path to the assembly folder on the FTP site (column 21 of prokaryotes.txt file) and download the *.protein.faa.gz files.

Accessions with the NZ_ prefix are RefSeq Chromosome accessions whereas the ones with GCF_ prefix are assembly accessions. An assembly includes both chromosome and plasmid(s).

ADD COMMENTlink written 9 months ago by vkkodali1.2k

As vkkodali wrote, use the prokayotes.txt. However, there's no need to join, one can filter on the Reference column. The NCBI discriminates between manually selected reference (exist for common lab organisms) and computationally selected representative genomes (exists for each species, almost).

ADD REPLYlink written 9 months ago by Carambakaracho1.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1742 users visited in the last hour