Question

How to download all accession ID of CDS from all viral genome

0

Entering edit mode

2.5 years ago

Kumar ▴ 170

Hi, I am looking to download all NCBI accession ID of CDS in a text file of all viral genomes. Please let me know how to download these. I tried following command but it downloads accession ID of the whole genome of virus.

esearch -db nucleotide -query "virus [orgn]" | efetch -format acc >gi_virus_id.txt

genome accession viral NCBI nucleotide • 2.4k views

ADD COMMENT • link updated 2.5 years ago by GenoMax 141k • written 2.5 years ago by Kumar ▴ 170

0

Entering edit mode

I assume you mean CDS's from accession IDs?

You could do something like this but it would be useful to use a very specific query than just "virus".

esearch -db nuccore -query "virus" | efetch -format fasta_cds_na

ADD REPLY • link 2.5 years ago by GenoMax 141k

0

Entering edit mode

Thanks for your help. However, I am looking to download just gi number, not fasta seq. Also, my goal to download all gi numbers of genes from all viruses is that because we want to see if our samples (fastq files) extracted from animals have any similarity to the genes in the virus. Consequently, it can give us the idea if the samples are pathogenic. Also, I mentioned CDS mistakenly, I need to download GI numbers from all viruses not CDS. According to the gi's header information such as "Influenza B virus (B/Victoria/4/2012) polymerase PA (PA) gene", "HIV-1 isolate F7S2CL10 gag protein (gag) gene" etc., we can see the virus names.

ADD REPLY • link 2.5 years ago by Kumar ▴ 170

0

Entering edit mode

gi numbers are deprecated for use outside NCBI. You should switch your workflow to use accession numbers instead.

ADD REPLY • link 2.5 years ago by GenoMax 141k

0

Entering edit mode

Ok! Could you please let me know the command of downloading all accession numbers of genes in all virus.

ADD REPLY • link 2.5 years ago by Kumar ▴ 170

0

Entering edit mode

Hi, please let me know if you have an idea of downloading accession number of genes in all virus species.

ADD REPLY • link 2.5 years ago by Kumar ▴ 170

score 0 · Answer 1 · 2021-11-04

0

Entering edit mode

2.5 years ago

GenoMax 141k

NCBI makes a file available here that includes information about genes for all viruses. There is a separate file for retroviruses. You can parse out accession numbers you need from these files.

ADD COMMENT • link 2.5 years ago by GenoMax 141k

0

Entering edit mode

Hi, Thank you. However, is there any command to download nucleotide sequences from all the viruses? I am trying to get which species of the virus in the input data (assembled reads of fastq files). I am using the following command but it shows an error. Please let me know how to improve this command.

esearch -db nucleotide -query "virus" | efetch -format acc >acc_ID

ADD REPLY • link 2.5 years ago by Kumar ▴ 170

0

Entering edit mode

Now you are asking a completely different question. For general download request like nucleotide sequences from all the viruses you can simply use Kai Blin's genome download tool (LINK) or genome_updater tool (LINK).

ADD REPLY • link 2.5 years ago by GenoMax 141k

0

Entering edit mode

I am sorry if I made any confusion. However, I just want to improve the command that I am using. Kai Blin's genome download tool and genome_updater do not have the option to download only accession of nucleotides. I am not looking to download complicated everything viral and bacterial genomes. Just accessions of nucleotides in all viruses or bacteria and I think the below command is able to do but I am not sure if it is correct. Anyway, thank you for your help.

esearch -db nucleotide -query "virus" | efetch -format acc >acc_ID

ADD REPLY • link 2.5 years ago by Kumar ▴ 170

0

Entering edit mode

While that command seems to bring back accession numbers I don't know if they are all guaranteed to be viral. There are also 4704837 records.

In case you can use viral genomes then this file from NCBI has accession numbers for all viral genomes, according to page of viral genomes at NCBI. It is a plain text file (though it has a .nbr extension). You can open it using an appropriate editor.

awk -F " " '{print $1}' taxid10239.nbr | sort | uniq > acc_list

should give you a list of viral genome accessions. Looks like there are 14060 at this time.

ADD REPLY • link 2.5 years ago by GenoMax 141k

0

Entering edit mode

Yes, I downloaded the file and opened it. It shows 14060 unique records. Are these total number of viruses records. You mentioned "4704837 records"?

ADD REPLY • link 2.5 years ago by Kumar ▴ 170

0

Entering edit mode

These are unique whole genomes. Your original query is against the entire nucleotide database.

ADD REPLY • link 2.5 years ago by GenoMax 141k

0

Entering edit mode

Sounds great. Also, could you please give me the links for other species such as bacteria, archaea if I download. I am following the page but I could not get the accession number.

ADD REPLY • link 2.5 years ago by Kumar ▴ 170

0

Entering edit mode

You can find representative and reference bacterial genomes in the respective report files. Look for the correct column to get the accession numbers.

ADD REPLY • link 2.5 years ago by GenoMax 141k