Question

How To Separate Protein Accessions From Nucleotide Accessions In The Livelists

1

Entering edit mode

11.7 years ago

rsingh2083 ▴ 10

Hi all, I just downloaded the livelist ftp://ftp.ncbi.nlm.nih.gov/genbank/livelists/GbAccList.0304.2012.gz Now I want to separate the protein accessions from nucleotide accessions and same them in two different text files.

Another problem is this :: I have a question regarding the GbAcclist. I just downloaded GbAccList.0304 and uncompressing it resulted in a file over 5 GB. The file has around 266 million gene accession ids. I have some 20,000 protein accession id's which I fed to BatchEntrez and retrieved the fasta file. Now what I want is only those Genes sequences/ Gene id's corresponding to these 20,000 proteins (the gene sequences coding these 20,000 proteins).

1.> The file I was using to pick up protein accession id's has total of 8,834,087 ids (or lines) and is 118 MB size, but I dont remember from where exactly on ncbi I downloaded it 7 months ago. Do you think this is what the number of protein sequences was there in NCBi during Jan'2012.

2.> What is the content wise difference between ftp://ftp.ncbi.nih.gov/gene/DATA/gene2accession.gz and ftp://ftp.ncbi.nih.gov/genbank/livelists/GbAccList.0304.2012.gz

3.> cat gene2accession | awk {'print $2'} | grep -v '-' | sort -u

  cat gene2accession | awk {'print $6'} | grep -v '-' | sort -u

The two commands above are giving 10120254 and 22995911 as output. So does that mean there are 10120254 genes and 22995911 proteins.

I just need all the accession ids of gene and protein in the database separately. Secondly I need a mapping from gene accession to protein accession(coded by a gene).

• 2.9k views

ADD COMMENT • link updated 11.7 years ago by Pierre Lindenbaum 161k • written 11.7 years ago by rsingh2083 ▴ 10

0

Entering edit mode

You should copy/paste a snippet of your files, it could help to understand your issue and needs.

ADD REPLY • link 11.7 years ago by Manu Prestat 4.1k

score 0 · Answer 1 · 2012-08-12

0

Entering edit mode

11.7 years ago

Pierre Lindenbaum 161k

"Now I want to separate the protein accessions from nucleotide accessions and same them in two different text files."

from ftp://ftp.ncbi.nlm.nih.gov/genbank/livelists/README.genbank.livelists

Protein accessions can be easily distinguished from nucleotide accessions because they have a three-letter prefix, followed by five digits. The remaining accessions are nucleotide accessions, in either a one-letter/five-digit format or a two-letter/six-digit format.

ADD COMMENT • link 11.7 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thanks Pierre. So according to you livelists contain all the protein "SEQUENCE" accession ids. I tried this

cat GbAccList.0304.2012  | sed -n '/^[[:alpha:]][[:alpha:]][[:alpha:]][[:digit:]]/p' | head

OUTPUT

EBA53284,1,134307104
EBA53285,1,134307105
EBA53286,1,134307106
EBA53287,1,134307107
EBA53283,1,134307103
EBA53288,1,134307109
EBA53289,1,134307110
EBA53290,1,134307111
EBA53291,1,134307113
EBA53292,1,134307114

But all of these are givin this error on NCBI:protein ::

"Database is not supported: protein"

But according to you these are protein accession ids,then why no fasta format for them ??

ADD REPLY • link 11.7 years ago by rsingh2083 ▴ 10