Hi all, I just downloaded the livelist ftp://ftp.ncbi.nlm.nih.gov/genbank/livelists/GbAccList.0304.2012.gz Now I want to separate the protein accessions from nucleotide accessions and same them in two different text files.
Another problem is this :: I have a question regarding the GbAcclist. I just downloaded GbAccList.0304 and uncompressing it resulted in a file over 5 GB. The file has around 266 million gene accession ids. I have some 20,000 protein accession id's which I fed to BatchEntrez and retrieved the fasta file. Now what I want is only those Genes sequences/ Gene id's corresponding to these 20,000 proteins (the gene sequences coding these 20,000 proteins).
1.> The file I was using to pick up protein accession id's has total of 8,834,087 ids (or lines) and is 118 MB size, but I dont remember from where exactly on ncbi I downloaded it 7 months ago. Do you think this is what the number of protein sequences was there in NCBi during Jan'2012.
2.> What is the content wise difference between ftp://ftp.ncbi.nih.gov/gene/DATA/gene2accession.gz and ftp://ftp.ncbi.nih.gov/genbank/livelists/GbAccList.0304.2012.gz
3.> cat gene2accession | awk {'print $2'} | grep -v '-' | sort -u
cat gene2accession | awk {'print $6'} | grep -v '-' | sort -u
The two commands above are giving 10120254 and 22995911 as output. So does that mean there are 10120254 genes and 22995911 proteins.
I just need all the accession ids of gene and protein in the database separately. Secondly I need a mapping from gene accession to protein accession(coded by a gene).
You should copy/paste a snippet of your files, it could help to understand your issue and needs.