NCBI's Influenza Virus Database Programmatic/ API Access
3
0
Entering edit mode
12 months ago

I would like to retrieve nucleotide FASTA and meta files from NCBI's Influenza Virus Database programmatically/ API.

I've tried using e-utility, but the Influenza Virus Database is not an option as a search criteria - I am wondering if there's anyone who worked out the criteria (the stuff after the base URL, e.g.: database, words etc.) using e-utility to get results out of the Influenza Virus Database.

If that is possible, would appreciate the info on how to filter out the country/region too with the e-utility URL.

I've also looked at the FTP option on their website, however, the FTP has not been updated since ~Oct 2020 for some reasons...? :(

Thank you for any advise and help in advance!

ncbi e-utility API influenza • 1.1k views
ADD COMMENT
4
Entering edit mode
12 months ago
MirianT_NCBI ▴ 720

Good morning,

You can access virus metadata and data through the web using the NCBI Virus page or programmatically using NCBI Datasets. The. NCBI virus data package includes the genomic FASTA sequences plus metadata in JSON-Lines format. To download the data package, you run the following commands. I'm including one for each type of Influenza, but NCBI Virus has some more specific classifications, such as Influenza A Virus (A/equine/Newmarket/1/1993(H3N8)).

datasets download virus genome taxon 197911 --filename alphainfluenza.zip
datasets download virus genome taxon 197912 --filename betainfluenza.zip
datasets download virus genome taxon 197913 --filename gammainfluenza.zip
datasets download virus genome taxon 1511083 --filename deltainfluenza.zip

If you want to filter the assemblies by country, you can use the --geo-location flag and the country of interest. For example:

datasets download virus genome taxon 1511083 --filename deltainfluenza_USA.zip --geo-location USA

When you unzip those files, you should find all the genomes in the file genomic.fna and all the metadata in the data_report.jsonl. NCBI datasets has a companion tool called dataformat that can convert JSON-Lines to tsv. You can read more about the virus-genome fields in this page.

Archive:  deltainfluenza.zip
  inflating: delta/README.md         
  inflating: delta/ncbi_dataset/data/data_report.jsonl  
  inflating: delta/ncbi_dataset/data/genomic.fna  
  inflating: delta/ncbi_dataset/data/virus_dataset.md  
  inflating: delta/ncbi_dataset/data/dataset_catalog.json

Please let me know if you run into any issues. I hope this helps!

ADD COMMENT
0
Entering edit mode
12 months ago
GenoMax 142k

Viral genome sequences are available via FTP here (updated last week): https://ftp.ncbi.nih.gov/genomes/Viruses/AllNucleotide/

In associated metadata file : https://ftp.ncbi.nih.gov/genomes/Viruses/AllNuclMetadata/AllNuclMetadata.csv.gz I see plenty of Influenza entries

$ zgrep Infl AllNuclMetadata.csv.gz
NC_036615.1,,"Hause,B.M., Ducatez,M., Collin,E.A., Ran,Z., Liu,R., Sheng,Z., Armien,A., Kaplan,B., Chakravarty,S., Hoppe,A.D., Webby,R.J., Simonson,R.R., Li,F.",2018-01-16T00:00:00Z,Influenza D virus,Deltainfluenzavirus,Orthomyxoviridae,ssRNA(-),2330,RefSeq,complete,,2,1,USA,,Sus scrofa,,2011-03-21,,"Influenza D virus (D/swine/Oklahoma/1334/2011) segment 2 polymerase PB1 (PB1) gene, complete cds"
NC_036616.1,,"Hause,B.M., Ducatez,M., Collin,E.A., Ran,Z., Liu,R., Sheng,Z., Armien,A., Kaplan,B., Chakravarty,S., Hoppe,A.D., Webby,R.J., Simonson,R.R., Li,F.",2018-01-16T00:00:00Z,Influenza D virus,Deltainfluenzavirus,Orthomyxoviridae,ssRNA(-),2364,RefSeq,complete,,1,1,USA,,Sus scrofa,,2011-03-21,,"Influenza D virus (D/swine/Oklahoma/1334/2011) segment 1 polymerase PB2 (PB2) gene, complete cds"
NC_036617.1,,"Hause,B.M., Ducatez,M., Collin,E.A., Ran,Z., Liu,R., Sheng,Z., Armien,A., Kaplan,B., Chakravarty,S., Hoppe,A.D., Webby,R.J., Simonson,R.R., Li,F.",2018-01-16T00:00:00Z,Influenza D virus,Deltainfluenzavirus,Orthomyxoviridae,ssRNA(-),1775,RefSeq,complete,,5,1,USA,,Sus scrofa,,2011-03-21,,"Influenza D virus (D/swine/Oklahoma/1334/2011) segment 5 nucleoprotein (NP) gene, complete cds"
NC_036618.1,,"Hause,B.M., Ducatez,M., Collin,E.A., Ran,Z., Liu,R., Sheng,Z., Armien,A., Kaplan,B., Chakravarty,S., Hoppe,A.D., Webby,R.J., Simonson,R.R., Li,F.",2018-01-16T00:00:00Z,Influenza D virus,Deltainfluenzavirus,Orthomyxoviridae,ssRNA(-),2049,RefSeq,complete,,4,1,USA,,Sus scrofa,,2011-03-21,,"Influenza D virus (D/swine/Oklahoma/1334/2011) segment 4 hemagglutinin-esterase precursor (HEF) gene, complete cds"
NC_036619.1,,"Hause,B.M., Ducatez,M., Collin,E.A., Ran,Z., Liu,R., Sheng,Z., Armien,A., Kaplan,B., Chakravarty,S., Hoppe,A.D., Webby,R.J., Simonson,R.R., Li,F.",2018-01-16T00:00:00Z,Influenza D virus,Deltainfluenzavirus,Orthomyxoviridae,ssRNA(-),2195,RefSeq,complete,,3,1,USA,,Sus scrofa,,2011-03-21,,"Influenza D virus (D/swine/Oklahoma/1334/2011) segment 3 polymerase 3 (P3) gene, complete cds"

$ zgrep Influenza AllNuclMetadata.csv.gz | wc -l
1035808
ADD COMMENT
0
Entering edit mode
12 months ago
GenoMax 142k

Using Entrezdirect

$ esearch -db nuccore -query "Influenza" | esummary | xtract -pattern DocumentSummary -element Caption,Title,SubName,Strain
NM_001252530    Mus musculus solute carrier organic anion transporter family, member 2b1 (Slco2b1), transcript variant 1, mRNA  C57BL/6|7|7     C57BL/6
XM_055582948    PREDICTED: Bubalus carabanensis influenza virus NS1A binding protein (IVNS1ABP), transcript variant X3, mRNA    K-KA32|Philippines|swamp buffalo|5|female|blood|heifer|2021-10-28|Philippine Carabao Center
XM_055582947    PREDICTED: Bubalus carabanensis influenza virus NS1A binding protein (IVNS1ABP), transcript variant X2, mRNA    K-KA32|Philippines|swamp buffalo|5|female|blood|heifer|2021-10-28|Philippine Carabao Center
XM_055582946    PREDICTED: Bubalus carabanensis influenza virus NS1A binding protein (IVNS1ABP), transcript variant X1, mRNA    K-KA32|Philippines|swamp buffalo|5|female|blood|heifer|2021-10-28|Philippine Carabao Center
XM_029146410    PREDICTED: Betta splendens influenza virus NS1A binding protein a (ivns1abpa), mRNA     4
XM_055534554    PREDICTED: Condylostylus longicornis influenza virus NS1A-binding protein homolog (LOC129619314), transcript variant X3, mRNA   FDR11|Unknown|male|whole body|adult|USA: Fort DeRussy Beach Park in Honolulu, HI|bushes|2020-11-24|Megan Porter|Megan Porter, Fleur Lebhardt
XM_055534553    PREDICTED: Condylostylus longicornis influenza virus NS1A-binding protein homolog (LOC129619314), transcript variant X2, mRNA   FDR11|Unknown|male|whole body|adult|USA: Fort DeRussy Beach Park in Honolulu, HI|bushes|2020-11-24|Megan Porter|Megan Porter, Fleur Lebhardt
XM_055534552    PREDICTED: Condylostylus longicornis influenza virus NS1A-binding protein homolog (LOC129619314), transcript variant X1, mRNA   FDR11|Unknown|male|whole body|adult|USA: Fort DeRussy Beach Park in Honolulu, HI|bushes|2020-11-24|Megan Porter|Megan Porter, Fleur Lebhardt
OQ851654        Influenza A virus (A/Pekin duck/California/T2202390/2022(H5N1)) segment 1 polymerase PB2 (PB2) gene, complete cds       A/Pekin duck/California/T2202390/2022|H5N1|T2202390|Anas platyrhynchos|USA: California|1|oropharyngeal swab|14-Nov-2022    A/Pekin duck/California/T2202390/2022
OQ851653        Influenza A virus (A/Pekin duck/California/T2202390/2022(H5N1)) segment 2 polymerase PB1 (PB1) and PB1-F2 protein (PB1-F2) genes, complete cds  A/Pekin duck/California/T2202390/2022|H5N1|T2202390|Anas platyrhynchos|USA: California|2|oropharyngeal swab|14-Nov-2022    A/Pekin duck/California/T2202390/2022

For the third column (where present) these are the headers

strain|serotype|host|country|segment|isolation_source|collection_date
ADD COMMENT

Login before adding your answer.

Traffic: 1872 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6