Question: How To Separate Protein Accessions From Nucleotide Accessions In The Livelists
1
gravatar for rsingh2083
6.9 years ago by
rsingh208310
rsingh208310 wrote:

Hi all, I just downloaded the livelist ftp://ftp.ncbi.nlm.nih.gov/genbank/livelists/GbAccList.0304.2012.gz Now I want to separate the protein accessions from nucleotide accessions and same them in two different text files.

Another problem is this :: I have a question regarding the GbAcclist. I just downloaded GbAccList.0304 and uncompressing it resulted in a file over 5 GB. The file has around 266 million gene accession ids. I have some 20,000 protein accession id's which I fed to BatchEntrez and retrieved the fasta file. Now what I want is only those Genes sequences/ Gene id's corresponding to these 20,000 proteins (the gene sequences coding these 20,000 proteins).

1.> The file I was using to pick up protein accession id's has total of 8,834,087 ids (or lines) and is 118 MB size, but I dont remember from where exactly on ncbi I downloaded it 7 months ago. Do you think this is what the number of protein sequences was there in NCBi during Jan'2012.

2.> What is the content wise difference between ftp://ftp.ncbi.nih.gov/gene/DATA/gene2accession.gz and ftp://ftp.ncbi.nih.gov/genbank/livelists/GbAccList.0304.2012.gz

3.> cat gene2accession | awk {'print $2'} | grep -v '-' | sort -u

  cat gene2accession | awk {'print $6'} | grep -v '-' | sort -u

The two commands above are giving 10120254 and 22995911 as output. So does that mean there are 10120254 genes and 22995911 proteins.

I just need all the accession ids of gene and protein in the database separately. Secondly I need a mapping from gene accession to protein accession(coded by a gene).

• 1.9k views
ADD COMMENTlink written 6.9 years ago by rsingh208310

You should copy/paste a snippet of your files, it could help to understand your issue and needs.

ADD REPLYlink written 6.9 years ago by Manu Prestat3.9k
0
gravatar for Pierre Lindenbaum
6.9 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum120k wrote:

"Now I want to separate the protein accessions from nucleotide accessions and same them in two different text files."

from ftp://ftp.ncbi.nlm.nih.gov/genbank/livelists/README.genbank.livelists

Protein accessions can be easily distinguished from nucleotide accessions because they have a three-letter prefix, followed by five digits. The remaining accessions are nucleotide accessions, in either a one-letter/five-digit format or a two-letter/six-digit format.

ADD COMMENTlink written 6.9 years ago by Pierre Lindenbaum120k

Thanks Pierre. So according to you livelists contain all the protein "SEQUENCE" accession ids. I tried this

cat GbAccList.0304.2012  | sed -n '/^[[:alpha:]][[:alpha:]][[:alpha:]][[:digit:]]/p' | head

OUTPUT

EBA53284,1,134307104
EBA53285,1,134307105
EBA53286,1,134307106
EBA53287,1,134307107
EBA53283,1,134307103
EBA53288,1,134307109
EBA53289,1,134307110
EBA53290,1,134307111
EBA53291,1,134307113
EBA53292,1,134307114

But all of these are givin this error on NCBI:protein ::

"Database is not supported: protein"

But according to you these are protein accession ids,then why no fasta format for them ??

ADD REPLYlink modified 6.9 years ago • written 6.9 years ago by rsingh208310
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1491 users visited in the last hour