How to download all accession ID of CDS from all viral genome
1
0
Entering edit mode
12 weeks ago
Kumar ▴ 120

Hi, I am looking to download all NCBI accession ID of CDS in a text file of all viral genomes. Please let me know how to download these. I tried following command but it downloads accession ID of the whole genome of virus.

esearch -db nucleotide -query "virus [orgn]" | efetch -format acc >gi_virus_id.txt

genome accession viral NCBI nucleotide • 828 views
0
Entering edit mode

I assume you mean CDS's from accession IDs?

You could do something like this but it would be useful to use a very specific query than just "virus".

esearch -db nuccore -query "virus" | efetch -format fasta_cds_na

0
Entering edit mode

Thanks for your help. However, I am looking to download just gi number, not fasta seq. Also, my goal to download all gi numbers of genes from all viruses is that because we want to see if our samples (fastq files) extracted from animals have any similarity to the genes in the virus. Consequently, it can give us the idea if the samples are pathogenic. Also, I mentioned CDS mistakenly, I need to download GI numbers from all viruses not CDS. According to the gi's header information such as "Influenza B virus (B/Victoria/4/2012) polymerase PA (PA) gene", "HIV-1 isolate F7S2CL10 gag protein (gag) gene" etc., we can see the virus names.

0
Entering edit mode

gi numbers are deprecated for use outside NCBI. You should switch your workflow to use accession numbers instead.

0
Entering edit mode

0
Entering edit mode

0
Entering edit mode
12 weeks ago
GenoMax 112k

NCBI makes a file available here that includes information about genes for all viruses. There is a separate file for retroviruses. You can parse out accession numbers you need from these files.

0
Entering edit mode

Hi, Thank you. However, is there any command to download nucleotide sequences from all the viruses? I am trying to get which species of the virus in the input data (assembled reads of fastq files). I am using the following command but it shows an error. Please let me know how to improve this command.

esearch -db nucleotide -query "virus" | efetch -format acc >acc_ID

0
Entering edit mode

Now you are asking a completely different question. For general download request like nucleotide sequences from all the viruses you can simply use Kai Blin's genome download tool (LINK) or genome_updater tool (LINK).

0
Entering edit mode

I am sorry if I made any confusion. However, I just want to improve the  command that I am using. Kai Blin's genome download tool and genome_updater do not have the option to download only accession of nucleotides. I am not looking to download complicated everything viral and bacterial genomes. Just accessions of nucleotides in all viruses or bacteria and I think the below command is able to do but I am not sure if it is correct. Anyway, thank you for your help.

esearch -db nucleotide -query "virus" | efetch -format acc >acc_ID

0
Entering edit mode

While that command seems to bring back accession numbers I don't know if they are all guaranteed to be viral. There are also 4704837 records.

In case you can use viral genomes then this file from NCBI has accession numbers for all viral genomes, according to page of viral genomes at NCBI. It is a plain text file (though it has a .nbr extension). You can open it using an appropriate editor.

awk -F " " '{print \$1}' taxid10239.nbr | sort | uniq > acc_list


should give you a list of viral genome accessions. Looks like there are 14060 at this time.

0
Entering edit mode

Yes, I downloaded the file and opened it. It shows 14060 unique records. Are these total number of viruses records. You mentioned "4704837 records"?

0
Entering edit mode

These are unique whole genomes. Your original query is against the entire nucleotide database.

0
Entering edit mode

Sounds great. Also, could you please give me the links for other species such as bacteria, archaea if I download. I am following the page but I could not get the accession number.

0
Entering edit mode

You can find representative and reference bacterial genomes in the respective report files. Look for the correct column to get the accession numbers.