How To Separate Protein Accessions From Nucleotide Accessions In The Livelists
1
1
Entering edit mode
11.7 years ago
rsingh2083 ▴ 10

Hi all, I just downloaded the livelist ftp://ftp.ncbi.nlm.nih.gov/genbank/livelists/GbAccList.0304.2012.gz Now I want to separate the protein accessions from nucleotide accessions and same them in two different text files.

Another problem is this :: I have a question regarding the GbAcclist. I just downloaded GbAccList.0304 and uncompressing it resulted in a file over 5 GB. The file has around 266 million gene accession ids. I have some 20,000 protein accession id's which I fed to BatchEntrez and retrieved the fasta file. Now what I want is only those Genes sequences/ Gene id's corresponding to these 20,000 proteins (the gene sequences coding these 20,000 proteins).

1.> The file I was using to pick up protein accession id's has total of 8,834,087 ids (or lines) and is 118 MB size, but I dont remember from where exactly on ncbi I downloaded it 7 months ago. Do you think this is what the number of protein sequences was there in NCBi during Jan'2012.

2.> What is the content wise difference between ftp://ftp.ncbi.nih.gov/gene/DATA/gene2accession.gz and ftp://ftp.ncbi.nih.gov/genbank/livelists/GbAccList.0304.2012.gz

3.> cat gene2accession | awk {'print $2'} | grep -v '-' | sort -u

  cat gene2accession | awk {'print $6'} | grep -v '-' | sort -u

The two commands above are giving 10120254 and 22995911 as output. So does that mean there are 10120254 genes and 22995911 proteins.

I just need all the accession ids of gene and protein in the database separately. Secondly I need a mapping from gene accession to protein accession(coded by a gene).

• 2.9k views
ADD COMMENT
0
Entering edit mode

You should copy/paste a snippet of your files, it could help to understand your issue and needs.

ADD REPLY
0
Entering edit mode
11.7 years ago

"Now I want to separate the protein accessions from nucleotide accessions and same them in two different text files."

from ftp://ftp.ncbi.nlm.nih.gov/genbank/livelists/README.genbank.livelists

Protein accessions can be easily distinguished from nucleotide accessions because they have a three-letter prefix, followed by five digits. The remaining accessions are nucleotide accessions, in either a one-letter/five-digit format or a two-letter/six-digit format.

ADD COMMENT
0
Entering edit mode

Thanks Pierre. So according to you livelists contain all the protein "SEQUENCE" accession ids. I tried this

cat GbAccList.0304.2012  | sed -n '/^[[:alpha:]][[:alpha:]][[:alpha:]][[:digit:]]/p' | head

OUTPUT

EBA53284,1,134307104
EBA53285,1,134307105
EBA53286,1,134307106
EBA53287,1,134307107
EBA53283,1,134307103
EBA53288,1,134307109
EBA53289,1,134307110
EBA53290,1,134307111
EBA53291,1,134307113
EBA53292,1,134307114

But all of these are givin this error on NCBI:protein ::

"Database is not supported: protein"

But according to you these are protein accession ids,then why no fasta format for them ??

ADD REPLY

Login before adding your answer.

Traffic: 2218 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6