Sequence Retrieval of SARS-CoV Complete Genomes
2.3 years ago
pfee418

Hi guys, I hope to only retrieve complete virus genomes on SARS-CoV (not SARS-CoV-2) and its strains. I tried to retrieve in NCBI Virus database but the search result showed all virus genomes mixed with SARS-CoV and SARS-CoV-2. Is there any ways where I can solely download SARS-CoV complete virus genomes? I'm okay with using other databases as well apart from NCBI Virus.

Thank you in advanced for all the suggestions and opinions.

2.3 years ago
vkkodali_ncbi

You can use NCBI Datasets for this. Navigate to the Coronavirus genomes page to use the web-based interface or use the command-line tool as follows:

## for sars-cov2 genomes

## for sars genomes


• genomic.fna (genomic sequences)
• cds.fna (nucleotide coding sequences)
• protein.faa (protein sequences)
• protein.gpff (protein sequence and annotation in GenPept flat file format)
• protein structures in PDB format
• data_report.jsonl (data report with viral metadata)
• virus_dataset.md (README containing details on sequence file data content and other information)
• dataset_catalog.json (a list of files and file types included in the dataset)

OP does NOT want SARS-CoV-2 genomes. Should your answer change to just sars-cov?

Thank you for noticing that I only want to retrieve SARS-CoV :)

Hi there, thank you for suggestions and the links. Looks like I had found a way to identify and find out SARS-CoV genomes through these websites. From Coronavirus genomes website, there is information/details that can be downloaded in the "Taxonomy" section. I have downloaded the csv file with information and slowly filter out all SARS-CoV-2 strains information and able to keep the SARS-CoV strains information. Thank you, the links are useful.

Yes, I missed that important detail. That said, the same tool can be used for sars-cov as well. If this is what the OP is looking for, then entering the taxid with the datasets command will do the trick. I will update my response.

2.3 years ago
GenoMax

peifei0418 : Since genome sequencing was not as prevalent in early 2000's you are not going to find hundreds of genomes of the original coronavirus. There are only 2 entries for Bat Coronaviruses here.

Oh I see. So, this means that there will be very few actual SARS-CoV strains sequences/genomes?

That is likely going to be the case.