9 months ago by
start from a list of protein accessions (i.e. WP_xxx) and get a correspondent list of assembly accessions (GCF_xxx).
If the intent here is to obtain a list of all GCF accessions that a particular WP is annotated on, then the answer above is not correct. It is fetching the entire list of GCFs for an organism irrespective of whether the WP is annotated on those assemblies or not. A couple of strategies to get the data you are looking for:
1. Use the IPG report
The IPG report for a WP accession will have the GCF accession. For example, here is the IPG table for WP_000134546. This particular WP appears to be annotated only on one assembly. For an example that is annotated on a number of assemblies, take a look at WP_003547430.1. You can simply download the entire table in CSV format by using the 'Send To' menu at the top right corner and choose 'File' as an option.
To do the same from the command line using Entrez Direct, you can repurpose the following code:
## returns tab-delimited data
efetch -db ipg -id 'WP_000134546.1' -format ipg
Id Source Nucleotide Accession Start Stop Strand Protein Protein Name Organism Strain Assembly
33217001 RefSeq NZ_AHLT01000014.1 13020 13304 - WP_000134546.1 hypothetical protein Staphylococcus aureus subsp. aureus IS-122 IS-122 GCF_000247415.1
The IPG output is available as XML too; you run the same command with
-format ipg -mode xml. IPG report will give you a row for every single protein accession out there, whether there is a WP annotated there or not. So it can be quite long for some proteins. But it’s easy enough to parse for just the WPs.
elink from protein to nuccore to assembly
Here, you just hop through a couple of
elink steps to go from NCBI Protein to NCBI Nucleotide to NCBI Assembly. This will return information about the assemblies but miss out on the rich information in the IPG report such as the nucleotide accession, range, strand, etc.
esearch -db protein -query 'WP_000134546' \
| elink -target nuccore -name protein_nuccore_wp \
| elink -db nuccore -target assembly -name nuccore_assembly \
| esummary \
| xtract -pattern DocumentSummary -element AssemblyAccession
You can find more information about the
elink descriptions here. NCBI has the following information on how to find related data starting with WPs.