Question: from GI or protein ID to genome coordinates
0
gravatar for andrey.v.shubin
2.5 years ago by
andrey.v.shubin50 wrote:

Hi!

I have a list of protein blast hits in RefSeq ID format (e.g., WP_012146310.1) , and now I want to retrieve genome coordinates of genes encoding these proteins.

I am using Entrez Direct to cope with it. Having the RefSeq ID, I got gi (501096268) and genome accession number (NC_009832.1), but I cannot get the gene coordinates, although I know that the coordinates are presented in the nuccore database...

I will appreciate your suggestions on this issue. Thanks!

ADD COMMENTlink modified 23 months ago by cmdcolin1.2k • written 2.5 years ago by andrey.v.shubin50
4
gravatar for andrey.v.shubin
2.5 years ago by
andrey.v.shubin50 wrote:

I have inquired NCBI support service about this question. This is their answer:

The information is stored in the identical protein report. https://www.ncbi.nlm.nih.gov/protein/WP_012146310.1?report=ipg

The following is an example of multiple wp accession input, combined with grep to retrieve only those refseq genome lines:

$ efetch -db protein -id WP_012146310.1,WP_011083771.1 -format ipg |grep -P "Start|WP_" 
Source Nucleotide Accession Start Stop Strand Protein Protein Name Organism Strain Superkingdom
RefSeq NC_009832.1 3981847 3982188 + WP_012146310.1 hypothetical protein [Serratia proteamaculans] Serratia proteamaculans 568 568 Bacteria
RefSeq NC_004463.1 1037016 1037285 - WP_011083771.1 hypothetical protein [Bradyrhizobium diazoefficiens] Bradyrhizobium diazoefficiens USDA 110 USDA 110 Bacteria
ADD COMMENTlink modified 2.5 years ago by genomax65k • written 2.5 years ago by andrey.v.shubin50

Makes the old code I wrote look like crap. Thanks for posting an update!

ADD REPLYlink written 2.5 years ago by Jenez520

Wondering what would be a good idea to do this on a mass scale, like all proteins in a genome?

ADD REPLYlink written 23 months ago by cmdcolin1.2k
1
gravatar for Jenez
2.5 years ago by
Jenez520
Sweden
Jenez520 wrote:

I can't think of a straight up 'single line of code' answer for you, hopefully someone else can.

What I would do is to download the genome genbank file, parse it using Biopython, and look for features (with info such as db_xref="GI:501096268") and grab the genomic coordinates from there. It might take a bit of reading about biopython (I'm not sure if you know python so there's that...). If you need to do this for several genomes, just set up a script to download the genbank file for the genome first and then run the above.

I tried to grab genomic sequences from protein ID's many months ago which entailed querying NCBI for identical protein reports and grabbing the coordinates from there but it was very cumbersome work.

Sorry, that's the best I can think of. Best of luck!

ADD COMMENTlink modified 2.5 years ago • written 2.5 years ago by Jenez520
1
gravatar for cmdcolin
23 months ago by
cmdcolin1.2k
United States
cmdcolin1.2k wrote:

To add onto the answer by @andrey.v.shubin I made a small script to try to query a large text file programmatically, if you have a file with a large list of protein IDs you can try and speed up NCBI for 50 protein IDs at a time

#!/usr/bin/env perl
use strict;
use warnings;

sub readNLines {
    my $lines_to_read = shift or die '....';
    my @lines_read;
    while(<>) {
        chomp;
        push( @lines_read, $_ );
        last if @lines_read == $lines_to_read;
    }
    return @lines_read;
}

sub readRecord {
    return join(',',readNLines(50))
}

while (my $lines = readRecord()) {
    my $output = `efetch -db protein -id $lines -format ipg|grep -v Start`;
    print $output;
}

With a file protein_ids.txt, one protein ID per line, then run

perl process.pl < protein_ids.txt > coords.txt
ADD COMMENTlink modified 23 months ago • written 23 months ago by cmdcolin1.2k
0
gravatar for harold.smith.tarheel
2.5 years ago by
United States
harold.smith.tarheel4.3k wrote:

I have limited experience using Entrez Direct, but I believe you want to include 'gene location [PROP]' in your query to obtain gene coordinates.

ADD COMMENTlink written 2.5 years ago by harold.smith.tarheel4.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 967 users visited in the last hour