I have a list of protein ids, of the WP_ type.
I need to find the assembly (GCA_, GCF_ ) they come from.
Any ideas on how to do it?
Have you checked NCBI's page on non-redundant RefSeq protein accession numbers here? As far as I see these accession numbers are going to be tied to genomic records that are not G*. See the example of WP_003547430. If you click on the genomic records under related information you can see the constituent genomes.
On command line you can use Entrezdirect to get this information:
esearch -db protein -query "WP_003547430.1 " | elink -db assembly -target nuccore | efetch -format docsum | xtract -pattern DocumentSummary -element Caption,Title
this will give you (truncated for space)
NZ_ATYZ01000003 Rhizobium leguminosarum bv. viciae UPM1131 A19QDRAFT_scaffold_2.3_C, whole genome shotgun sequence
NZ_ATTP01000009 Rhizobium leguminosarum bv. viciae GB30 A3A3DRAFT_scaffold_8.9_C, whole genome shotgun sequence
NC_021905 Rhizobium etli bv. mimosae str. Mim1, complete genome
NZ_ARRT01000006 Rhizobium leguminosarum bv. viciae 248 RLEG17DRAFT_Scaffold1.7_C, whole genome shotgun sequence
NZ_MRDL01000032 Rhizobium leguminosarum bv. viciae USDA 2370 scaffold22, whole genome shotgun sequence
NZ_MRDM01000002 Rhizobium laguerreae strain FB206 scaffold16, whole genome shotgun sequence
When I manually checked a few on the website, I found a link through the Identical Protein Groups database.
An example is WP_043107373.1, which through some digging is associated with GCF_000801295.1.
it also has an NZ number (NZ_AP012978.1), but I think this might be the accession number of the gene.
I can get the NZ* ID but not GCF* yet.
esearch -db protein -query "WP_043107373.1" | elink -db assembly -target nuccore | efetch -format acc
Thank a lot, it works.
Do you know how to do this query in batch? I have a list of ids in a file and at the second attempt there is an error message.
Post a few examples here. Idea would be to do something like this:
epost -input your_file_w_id | elink -target nuccore -db protein | elink -target assembly| esummary | xtract -pattern AssemblyAccession -element AssemblyAccession
Please use ADD REPLY/ADD COMMENT when responding to existing answers to keep threads logically organized.
ADD REPLY/ADD COMMENT
Thanks! this works.
I'm also trying to have it return the original query prtotein ID.
could you also help with that?
What do you mean by that? Just the ID that is in your own file/you are using for search?
You can accept @Sej's answer below to provide closure to this thread at some point.
I'm using the following:
epost -input file-with-gi-numbers -db protein | elink -target nuccore -db protein | elink -target assembly | esummary | xtract -pattern AssemblyAccession -element AssemblyAccession
I get a list of GCA numbers that is longer than the list of accessions. I would like to have the final result in the format of:
I don't think there is a way to do this within Entrezdirect. Since we are cross-linking to different databases the information about original query is not carried forward.
The following command would return the assembly accession number:
elink -target nuccore -db protein -id "WP_043107373.1"|elink -target assembly|esummary |xtract -pattern AssemblyAccession -element AssemblyAccession
Truncated for space.
$ elink -target nuccore -db protein -id "WP_003547430.1"|elink -target assembly|esummary |xtract -pattern AssemblyAccession -element AssemblyAccession
Login before adding your answer.
Use of this site constitutes acceptance of our User Agreement and Privacy