Question

What'S The Easiest Way To Download All The Nucleotide Sequences For Proteins In Cog?

0

Entering edit mode

12.2 years ago

Tianyang Li ▴ 500

Hi,

What's the easiest way to download all the nucleotide sequences for the proteins in COG?

I hope there's a better way than using an web API to download nucleotide sequences one by one.

Thanks!

protein nucleotide sequence • 3.0k views

ADD COMMENT • link updated 13 months ago by Ram 43k • written 12.2 years ago by Tianyang Li ▴ 500

score 4 · Answer 1 · 2012-02-21

In theory you could download this file from the COG FTP site:

wget -O cog.txt "ftp://ftp.ncbi.nih.gov/pub/COG/COG/myva=gb"
head -5 cog.txt
# APE0180 14600509
# APE0225 14600543
# APE0277 14600591
# APE0307 14600619
# APE0324 14600631

The second column contains protein GIs. You could then write a script using NCBI EUtils, to link the protein GI with, for example, Gene GI:

curl "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=protein&db=gene&id=19076072"


http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eLink_101123.dtd">
<eLinkResult>

    <LinkSet>
        <DbFrom>protein</DbFrom>
        <IdList>
            <Id>19076072</Id>
        </IdList>
        <LinkSetDb>
            <DbTo>gene</DbTo>
            <LinkName>protein_gene</LinkName>
            <Link>
                <Id>2539562</Id>
            </Link>
        </LinkSetDb>
    </LinkSet>
</eLinkResult>

Parse the output, get the list of Gene GIs and submit to Batch Entrez to retrieve nucleotide sequence.

The problem: COG has not been updated in years, so many of the protein GIs are now retired. So you'll have to either work around that or perhaps, not use COG - it is very outdated and not maintained.