Converting RefSeq protein accession IDs into entreZ IDs
0
0
Entering edit mode
11 months ago
Pegasus ▴ 100

Hi, I have a list of genes with Refseq accession ids and I want to convert it to EntrezID, which can then be fit in the GENE ONTOLOGY enrichment and pathway analysis like DAVID and gProfile (these IDs belong to a bacterial specie that is not supported by ensemble nor gProfile.

I followed the post;

Bioinformatics: Converting Protein Refseq ID to Entrez Gene Accession

and still not able to convert these IDs, because it is different organism/specie. These RefSeq IDs were extracted from the reference.genome.gtf file (downloaded from NCBI)

Examples of these RefSeq protein accessions like below:

WP_007431075.1 WP_010344636.1 WP_017427837.1 WP_014278738.1 WP_010344656.1 WP_019688556.1 WP_016819793.1 WP_007724645.1 WP_016821111.1 NA WP_010347944.1 WP_016819622.1 NA

Could you please suggest any website/ tool or R-package,

Thank you

RNA-SEQ • 704 views
ADD COMMENT
1
Entering edit mode

WP* accession numbers refer to multiple genomes. See: https://www.ncbi.nlm.nih.gov/refseq/about/nonredundantproteins/

The best you could do is to get the IPG ID's.

$ efetch -db protein -id WP_017427837 -format ipg
Id      Source  Nucleotide Accession    Start   Stop    Strand  Protein Protein Name    Organism        Strain  Assembly
38029250        RefSeq  NZ_AMQU01000019.1       79641   80531   -       WP_017427837.1  ABC transporter permease subunit        Paenibacillus sp. ICGEB2008     ICGEB2008 GCF_000307675.1
38029250        RefSeq  NZ_CP023711.1   1197135 1198025 -       WP_017427837.1  ABC transporter permease subunit        Paenibacillus polymyxa  C12     GCF_022649565.1
ADD REPLY
0
Entering edit mode

Thank you GenoMax, since efetch function is not supported by the HPC I am working on, I replaced it with the command below;

 curl -s "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=WP_017427837&rettype=ipg&retmode=text"

It worked well as below, so;

  1. Which number does represent the ipg_ID?

  2. can we modify it to work automatically through a list of 4000 IDs in csv.file ? and produce a list of their corresponding IPG_IDS as output.csv?

  3. Should I re-convert the IPG IDs into entreZ in which I can advance to gene ontology/ pathway analysis, if yes, what tool do you recommend?

 Id      Source  Nucleotide Accession    Start   Stop    Strand  Protein Protein Name    Organism        Strain  Assembly
    38029250        RefSeq  NZ_AMQU01000019.1       79641   80531   -       WP_017427837.1  ABC transporter permease subunit        Paenibacillus sp. ICGEB2008     ICGEB2008  GCF_000307675.1
    38029250        RefSeq  NZ_CP023711.1   1197135 1198025 -       WP_017427837.1  ABC transporter permease subunit        Paenibacillus polymyxa  C12     GCF_022649565.1
    38029250        RefSeq  NZ_JWJJ01000001.1       862877  863767  -       WP_017427837.1  ABC transporter permease subunit        Paenibacillus polymyxa A18      A18GCF_000809185.2
    38029250        INSDC   AMQU01000019.1  79641   80531   -       KKD53569.1      protein lplB    Paenibacillus sp. ICGEB2008     ICGEB2008       GCA_000307675.1
    38029250        INSDC   CP023711.1      1197135 1198025 -       UNL92992.1      sugar ABC transporter permease  Paenibacillus polymyxa  C12     GCA_022649565.1
ADD REPLY

Login before adding your answer.

Traffic: 3020 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6