What'S The Easiest Way To Download All The Nucleotide Sequences For Proteins In Cog?
1
0
Entering edit mode
12.2 years ago
Tianyang Li ▴ 500

Hi,

What's the easiest way to download all the nucleotide sequences for the proteins in COG?

I hope there's a better way than using an web API to download nucleotide sequences one by one.

Thanks!

protein nucleotide sequence • 3.0k views
ADD COMMENT
4
Entering edit mode
12.2 years ago
Neilfws 49k

In theory you could download this file from the COG FTP site:

wget -O cog.txt "ftp://ftp.ncbi.nih.gov/pub/COG/COG/myva=gb"
head -5 cog.txt
# APE0180 14600509
# APE0225 14600543
# APE0277 14600591
# APE0307 14600619
# APE0324 14600631

The second column contains protein GIs. You could then write a script using NCBI EUtils, to link the protein GI with, for example, Gene GI:

curl "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=protein&db=gene&id=19076072"


http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eLink_101123.dtd">
<eLinkResult>

    <LinkSet>
        <DbFrom>protein</DbFrom>
        <IdList>
            <Id>19076072</Id>
        </IdList>
        <LinkSetDb>
            <DbTo>gene</DbTo>
            <LinkName>protein_gene</LinkName>
            <Link>
                <Id>2539562</Id>
            </Link>
        </LinkSetDb>
    </LinkSet>
</eLinkResult>

Parse the output, get the list of Gene GIs and submit to Batch Entrez to retrieve nucleotide sequence.

The problem: COG has not been updated in years, so many of the protein GIs are now retired. So you'll have to either work around that or perhaps, not use COG - it is very outdated and not maintained.

ADD COMMENT

Login before adding your answer.

Traffic: 2177 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6