I am working on the analysis of prokaryotic genomes from NCBI genome database.
- Downloaded a file called prok_representative_genomes.txt from the following file ftp://ftp.ncbi.nlm.nih.gov/genomes/GENOME_REPORTS/prok_representative_genomes.txt
After opening the file, we could see one column named "Chromosome RefSeq". (e.g., NZ_AQXM00000000)
- Download all protein sequences for all bacterial genomes from https://www.ncbi.nlm.nih.gov/assembly.
Each file has a name like "GCF_000834735.1_ASM83473v1_protein.faa.gz".
It is odd that the two datasets use different accession number system. In this case, how to identify if the genome of the proteins is annotated in the file "prok_representative_genomes.txt"？
My aim is to retrieve all the protein sequences for the genomes listed in the file "prok_representative_genomes.txt".
Thanks a lot,