Question: finding the genome of origin of a protein on genbank
0
gravatar for Lille My
16 months ago by
Lille My30
Lille My30 wrote:

Hi all, I have a list of protein ids, of the WP_ type. I need to find the assembly (GCA_, GCF_ ) they come from. Any ideas on how to do it? Thanks :-)

assembly ncbi • 414 views
ADD COMMENTlink modified 16 months ago • written 16 months ago by Lille My30

Have you checked NCBI's page on non-redundant RefSeq protein accession numbers here? As far as I see these accession numbers are going to be tied to genomic records that are not G*. See the example of WP_003547430. If you click on the genomic records under related information you can see the constituent genomes.

On command line you can use Entrezdirect to get this information:

 esearch -db protein -query "WP_003547430.1 " | elink -db assembly -target nuccore  | efetch -format docsum | xtract -pattern DocumentSummary -element Caption,Title

this will give you (truncated for space)

NZ_ATYZ01000003 Rhizobium leguminosarum bv. viciae UPM1131 A19QDRAFT_scaffold_2.3_C, whole genome shotgun sequence
NZ_ATTP01000009 Rhizobium leguminosarum bv. viciae GB30 A3A3DRAFT_scaffold_8.9_C, whole genome shotgun sequence
NC_021905       Rhizobium etli bv. mimosae str. Mim1, complete genome
NZ_ARRT01000006 Rhizobium leguminosarum bv. viciae 248 RLEG17DRAFT_Scaffold1.7_C, whole genome shotgun sequence
NZ_MRDL01000032 Rhizobium leguminosarum bv. viciae USDA 2370 scaffold22, whole genome shotgun sequence
NZ_MRDM01000002 Rhizobium laguerreae strain FB206 scaffold16, whole genome shotgun sequence
ADD REPLYlink modified 16 months ago • written 16 months ago by genomax85k

Interesting. When I manually checked a few on the website, I found a link through the Identical Protein Groups database. An example is WP_043107373.1, which through some digging is associated with GCF_000801295.1. it also has an NZ number (NZ_AP012978.1), but I think this might be the accession number of the gene.

ADD REPLYlink written 16 months ago by Lille My30

I can get the NZ* ID but not GCF* yet.

esearch -db protein -query "WP_043107373.1" | elink -db assembly -target nuccore  | efetch -format acc
NZ_AP012978.1
ADD REPLYlink written 16 months ago by genomax85k

Thank a lot, it works. Do you know how to do this query in batch? I have a list of ids in a file and at the second attempt there is an error message.

ADD REPLYlink written 16 months ago by Lille My30

Post a few examples here. Idea would be to do something like this:

epost -input your_file_w_id | elink -target nuccore -db protein | elink -target assembly| esummary | xtract -pattern AssemblyAccession -element AssemblyAccession

Please use ADD REPLY/ADD COMMENT when responding to existing answers to keep threads logically organized.

ADD REPLYlink modified 16 months ago • written 16 months ago by genomax85k

Thanks! this works. I'm also trying to have it return the original query prtotein ID. could you also help with that? thanks

ADD REPLYlink written 16 months ago by Lille My30

What do you mean by that? Just the ID that is in your own file/you are using for search?

You can accept @Sej's answer below to provide closure to this thread at some point.

ADD REPLYlink written 16 months ago by genomax85k

I'm using the following:

epost -input file-with-gi-numbers -db protein | elink -target nuccore -db protein | elink -target assembly | esummary  | xtract -pattern AssemblyAccession -element AssemblyAccession

I get a list of GCA numbers that is longer than the list of accessions. I would like to have the final result in the format of:

<query> GCA_number or: 12345 GCA_12345

ADD REPLYlink written 16 months ago by Lille My30

I don't think there is a way to do this within Entrezdirect. Since we are cross-linking to different databases the information about original query is not carried forward.

ADD REPLYlink written 16 months ago by genomax85k
2
gravatar for Sej Modha
16 months ago by
Sej Modha4.7k
Glasgow, UK
Sej Modha4.7k wrote:

The following command would return the assembly accession number:

elink -target nuccore -db protein -id "WP_043107373.1"|elink -target assembly|esummary |xtract -pattern AssemblyAccession -element AssemblyAccession

GCF_000801295.1
ADD COMMENTlink modified 16 months ago • written 16 months ago by Sej Modha4.7k

Truncated for space.

$ elink -target nuccore -db protein -id "WP_003547430.1"|elink -target assembly|esummary |xtract -pattern AssemblyAccession -element AssemblyAccession
GCF_004307185.1
GCF_004307195.1
GCF_004307135.1
GCF_004307165.1
GCF_004307125.1
GCF_004303745.1
GCF_004307045.1
GCF_004307035.1
GCF_004307025.1
GCF_004306835.1
GCF_004306925.1
GCF_004306885.1
ADD REPLYlink written 16 months ago by genomax85k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1735 users visited in the last hour