Fetching a representative genome, gene and protein IDs from NCBI using a list of spp names
1
0
Entering edit mode
16 months ago
venura ▴ 70

Hi,

I would like to bulk retrieve the following information for a list of virus species from NCBI.

  • Representative genome (e.g Machupo mammarenavirus)
  • Link to the genome (https://www.ncbi.nlm.nih.gov/genome/?term=Machupo+mammarenavirus)
  • replicon info

    ( Type Name RefSeq INSDC Size (Kb) GC% Protein Gene Chr S NC_005078.1 AY129248.1 3.44 43.4 2 2 Chr L NC_005079.1 AY358021.2 7.2 41.0 2 2)

  • refseq IDs for chr (S, M, L etc) (NC_005078.1, NC_005079.1)

  • gene and protein IDs found in each segment/chr (Machupo virus segment S - GeneID:2943093 /locus_tag="MACVsSgp1" /db_xref="GeneID:2943093, /protein_id="NP_899212.1")

Is there a way to bulk retrieve this info? I have used efetch and esearch to retrieve sequences before but having a hard time figuring out how to get the above information. Hope someone can help me. Thank you in advance.

NCBI gene protein • 421 views
ADD COMMENT
1
Entering edit mode

Download the assembly summary file for GenBank genomes. Then parse out top level assembly information for viruses from that file. Take a look at column headers to decide what you need.

$ grep Machupo  assembly_summary_genbank.txt 
GCA_000853725.1                         na      11628   11628   Machupo mammarenavirus  strain=Carvallo         latest  Complete Genome Major   Full    2003/08/24      ViralMultiSegProj14931            GCF_000853725.1 identical       ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/853/725/GCA_000853725.1_ViralMultiSegProj14931           ICTV species exemplar

Similar information also available for RefSeq genomes.

$ grep Machupo  assembly_summary_refseq.txt 
GCF_000853725.1 PRJNA485481                     na      11628   11628   Machupo mammarenavirus  strain=Carvallo         latest  Complete Genome Major   Full    2003/08/30      ViralMultiSegProj14931            GCA_000853725.1 identical       ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/853/725/GCF_000853725.1_ViralMultiSegProj14931           ICTV species exemplar
ADD REPLY
0
Entering edit mode

Thank you. very helpful! Now I can retrieve assembly IDs using the method you suggested.

How can I use the assembly ID to obtain all gene IDs , protein IDs, and locustags included under each assembly ID ? I will be using 200 viral genome assemblies as a query.

ADD REPLY
1
Entering edit mode

If you already have the assembly IDs you can download the feature_table.txt file associated with that assembly. It should have the information you are looking for. For example, the feature_table.txt file for Machupo assembly is here. You should be able to automate this by first using esearch and esummary to get the FTP paths and then using the wget or curl commands to download the feature_table.txt files.

ADD REPLY
1
Entering edit mode

This is what @vkkodali is referring to:

$ esearch -db assembly -query "GCF_000853725" | esummary | xtract -pattern DocumentSummary -element FtpPath_RefSeq
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/853/725/GCF_000853725.1_ViralMultiSegProj14931

This will get you the feature_table file:

$ esearch -db assembly -query "GCF_000853725" | esummary | xtract -pattern DocumentSummary -element FtpPath_RefSeq | xargs -n 1 sh -c 'wget -r -nd --no-parent -A "*table.txt.gz" "$0"'

$ ls *.txt.gz
GCF_000853725.1_ViralMultiSegProj14931_feature_table.txt.gz
ADD REPLY
0
Entering edit mode

Thank you @ genomax and @vkkodali

cat virus.txt | while read line; do grep ${line} assembly_summary_refseq.txt >> resultfile.txt ; done

Where I have the following viruses inside virus.txt

Allpahuayo mammarenavirus 
Argentinian mammarenavirus 
Bear Canyon mammarenavirus
Brazilian mammarenavirus
Cali mammarenavirus
  

When I use it I get the following error

grep: mammarenavirus: No such file or directory grep: mammarenavirus: No such file or directory grep: Canyon: No such file or directory

So I removed the term 'mammarenavirus' and reran the command. Now I am getting unrelated results in addition to search terms. What am I doing wrong? :(

ADD REPLY
1
Entering edit mode

Try

cat virus.txt | while read line; do grep -w "${line}" assembly_summary_refseq.txt >> resultfile.txt ; done
ADD REPLY
0
Entering edit mode

Appreciate it! Thank you! I went forward and used the extracted assemblies using the following command.

cat assembly.txt | while read line; do esearch -db assembly -query "${line}" | esummary | xtract -pattern DocumentSummary -element FtpPath_RefSeq | xargs -n 1 sh -c 'wget -r -nd --no-parent -A "*table.txt.gz" "$0"' ; done

It retrieved only one feature file though assembly.txt contained five assembly IDs. Following is the log after running the command.

--2020-03-26 21:25:04-- ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/879/015/GCF_000879015.1_ViralMultiSegProj28321 => '.listing' Resolving ftp.ncbi.nlm.nih.gov ftp.ncbi.nlm.nih.gov)... 130.14.250.12, 2607:f220:41e:250::13 Connecting to ftp.ncbi.nlm.nih.gov ftp.ncbi.nlm.nih.gov)|130.14.250.12|:21... connected. Logging in as anonymous ... Logged in! ==> SYST ... done. ==> PWD ... done. ==> TYPE I ... done. ==> CWD (1) /genomes/all/GCF/000/879/015 ... done. ==> PASV ... done. ==> LIST ... done.

.listing [ <=>
] 96 --.-KB/s in 0.002s

2020-03-26 21:25:04 (40.5 KB/s) - '.listing' saved [96]

Removed '.listing'. --2020-03-26 21:25:04-- ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/879/015/GCF_000879015.1_ViralMultiSegProj28321/GCF_000879015.1_ViralMultiSegProj28321 => '.listing' ==> CWD (1) /genomes/all/GCF/000/879/015/GCF_000879015.1_ViralMultiSegProj28321 ... done. ==> PASV ... done. ==> LIST ... done.

.listing [ <=>
] 1.66K --.-KB/s in 0.009s

2020-03-26 21:25:05 (176 KB/s) - '.listing' saved [1698]

Removed '.listing'. Rejecting 'GCF_000879015.1_ViralMultiSegProj28321_assembly_report.txt'. Rejecting 'GCF_000879015.1_ViralMultiSegProj28321_assembly_stats.txt'. Rejecting 'GCF_000879015.1_ViralMultiSegProj28321_cds_from_genomic.fna.gz'. Rejecting 'GCF_000879015.1_ViralMultiSegProj28321_feature_count.txt.gz'. Rejecting 'GCF_000879015.1_ViralMultiSegProj28321_genomic.fna.gz'. Rejecting 'GCF_000879015.1_ViralMultiSegProj28321_genomic.gbff.gz'. Rejecting 'GCF_000879015.1_ViralMultiSegProj28321_genomic.gff.gz'. Rejecting 'GCF_000879015.1_ViralMultiSegProj28321_genomic.gtf.gz'. Rejecting 'GCF_000879015.1_ViralMultiSegProj28321_protein.faa.gz'. Rejecting 'GCF_000879015.1_ViralMultiSegProj28321_protein.gpff.gz'. Rejecting 'GCF_000879015.1_ViralMultiSegProj28321_translated_cds.faa.gz'. Rejecting 'README.txt'. Rejecting 'annotation_hashes.txt'. Rejecting 'assembly_status.txt'. Rejecting 'md5checksums.txt'. --2020-03-26 21:25:05-- ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/879/015/GCF_000879015.1_ViralMultiSegProj28321/GCF_000879015.1_ViralMultiSegProj28321_feature_table.txt.gz => 'GCF_000879015.1_ViralMultiSegProj28321_feature_table.txt.gz.1' ==> CWD not required. ==> PASV ... done. ==> RETR GCF_000879015.1_ViralMultiSegProj28321_feature_table.txt.gz ... done. Length: 446

GCF_000879015.1_ViralMultiSegProj28321_fe 100%[====================================================================================>] 446 --.-KB/s in 0.02s

2020-03-26 21:25:05 (28.4 KB/s) - 'GCF_000879015.1_ViralMultiSegProj28321_feature_table.txt.gz.1' saved [446]

FINISHED --2020-03-26 21:25:05-- Total wall clock time: 1.1s Downloaded: 1 files, 446 in 0.03s (16.1 KB/s)

ADD REPLY
1
Entering edit mode

Take a look at the 'While Loop' subsection here. Essentially, you need to add < /dev/null to the esearch command as follows:

cat assembly.txt \
  | while read line; do 
      esearch -db assembly -query "${line}"  < /dev/null \
      | esummary | xtract -pattern DocumentSummary -element FtpPath_RefSeq \
      | xargs -n 1 sh -c 'wget -r -nd --no-parent -A "*table.txt.gz" "$0"' ; 
  done
ADD REPLY
0
Entering edit mode

Sweet! Worked like a charm! Thank you @vkkodali

ADD REPLY
0
Entering edit mode
16 months ago
vkkodali ★ 2.6k

You can use the NCBI Assembly portal for this. If you know the organism name or better yet an NCBI Taxonomy identifier, you can query the assembly database with that and click on the 'Representative genomes' filter on the left hand side. Then, use the 'Download Assemblies' button to download data. I recommend exploring the files that you can download as the data you seek are spread out across a few files. For example, gene and protein IDs would be in the GFF3 files; RefSeq identifiers and chromosome identifiers can be found in the assembly_report.txt file and so on.

If you want to do this using command line options, you can use esearch and esummary to get the FTP paths and then download the files of your interest using wget or curl. See: www.biostars.org/p/428504/ for some ideas.

ADD COMMENT

Login before adding your answer.

Traffic: 1326 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6