Question: What'S The Easiest Way To Download All The Nucleotide Sequences For Proteins In Cog?
0
gravatar for Tianyang Li
7.7 years ago by
Tianyang Li490
Beijing, China
Tianyang Li490 wrote:

Hi,

What's the easiest way to download all the nucleotide sequences for the proteins in COG?

I hope there's a better way than using an web API to download nucleotide sequences one by one.

Thanks!

ADD COMMENTlink written 7.7 years ago by Tianyang Li490
4
gravatar for Neilfws
7.7 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

In theory you could download this file from the COG FTP site:

wget -O cog.txt "ftp://ftp.ncbi.nih.gov/pub/COG/COG/myva=gb"
head -5 cog.txt
# APE0180 14600509
# APE0225 14600543
# APE0277 14600591
# APE0307 14600619
# APE0324 14600631

The second column contains protein GIs. You could then write a script using NCBI EUtils, to link the protein GI with, for example, Gene GI:

curl "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=protein&db=gene&id=19076072"


http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eLink_101123.dtd">
<eLinkResult>

    <LinkSet>
        <DbFrom>protein</DbFrom>
        <IdList>
            <Id>19076072</Id>
        </IdList>
        <LinkSetDb>
            <DbTo>gene</DbTo>
            <LinkName>protein_gene</LinkName>
            <Link>
                <Id>2539562</Id>
            </Link>
        </LinkSetDb>
    </LinkSet>
</eLinkResult>

Parse the output, get the list of Gene GIs and submit to Batch Entrez to retrieve nucleotide sequence.

The problem: COG has not been updated in years, so many of the protein GIs are now retired. So you'll have to either work around that or perhaps, not use COG - it is very outdated and not maintained.

ADD COMMENTlink written 7.7 years ago by Neilfws48k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1766 users visited in the last hour