Question: Fetching a representative genome, gene and protein IDs from NCBI using a list of spp names
0
gravatar for venura
5 days ago by
venura10
venura10 wrote:

Hi,

I would like to bulk retrieve the following information for a list of virus species from NCBI.

  • Representative genome (e.g Machupo mammarenavirus)
  • Link to the genome (https://www.ncbi.nlm.nih.gov/genome/?term=Machupo+mammarenavirus)
  • replicon info

    ( Type Name RefSeq INSDC Size (Kb) GC% Protein Gene Chr S NC_005078.1 AY129248.1 3.44 43.4 2 2 Chr L NC_005079.1 AY358021.2 7.2 41.0 2 2)

  • refseq IDs for chr (S, M, L etc) (NC_005078.1, NC_005079.1)

  • gene and protein IDs found in each segment/chr (Machupo virus segment S - GeneID:2943093 /locus_tag="MACVsSgp1" /db_xref="GeneID:2943093, /protein_id="NP_899212.1")

Is there a way to bulk retrieve this info? I have used efetch and esearch to retrieve sequences before but having a hard time figuring out how to get the above information. Hope someone can help me. Thank you in advance.

protein gene ncbi • 114 views
ADD COMMENTlink modified 5 days ago by vkkodali1.9k • written 5 days ago by venura10
1

Download the assembly summary file for GenBank genomes. Then parse out top level assembly information for viruses from that file. Take a look at column headers to decide what you need.

$ grep Machupo  assembly_summary_genbank.txt 
GCA_000853725.1                         na      11628   11628   Machupo mammarenavirus  strain=Carvallo         latest  Complete Genome Major   Full    2003/08/24      ViralMultiSegProj14931            GCF_000853725.1 identical       ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/853/725/GCA_000853725.1_ViralMultiSegProj14931           ICTV species exemplar

Similar information also available for RefSeq genomes.

$ grep Machupo  assembly_summary_refseq.txt 
GCF_000853725.1 PRJNA485481                     na      11628   11628   Machupo mammarenavirus  strain=Carvallo         latest  Complete Genome Major   Full    2003/08/30      ViralMultiSegProj14931            GCA_000853725.1 identical       ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/853/725/GCF_000853725.1_ViralMultiSegProj14931           ICTV species exemplar
ADD REPLYlink modified 5 days ago • written 5 days ago by genomax80k

Thank you. very helpful! Now I can retrieve assembly IDs using the method you suggested.

How can I use the assembly ID to obtain all gene IDs , protein IDs, and locustags included under each assembly ID ? I will be using 200 viral genome assemblies as a query.

ADD REPLYlink modified 5 days ago • written 5 days ago by venura10
1

If you already have the assembly IDs you can download the feature_table.txt file associated with that assembly. It should have the information you are looking for. For example, the feature_table.txt file for Machupo assembly is here. You should be able to automate this by first using esearch and esummary to get the FTP paths and then using the wget or curl commands to download the feature_table.txt files.

ADD REPLYlink modified 5 days ago • written 5 days ago by vkkodali1.9k
1

This is what @vkkodali is referring to:

$ esearch -db assembly -query "GCF_000853725" | esummary | xtract -pattern DocumentSummary -element FtpPath_RefSeq
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/853/725/GCF_000853725.1_ViralMultiSegProj14931

This will get you the feature_table file:

$ esearch -db assembly -query "GCF_000853725" | esummary | xtract -pattern DocumentSummary -element FtpPath_RefSeq | xargs -n 1 sh -c 'wget -r -nd --no-parent -A "*table.txt.gz" "$0"'

$ ls *.txt.gz
GCF_000853725.1_ViralMultiSegProj14931_feature_table.txt.gz
ADD REPLYlink modified 5 days ago • written 5 days ago by genomax80k

Thank you @ genomax and @vkkodali

cat virus.txt | while read line; do grep ${line} assembly_summary_refseq.txt >> resultfile.txt ; done

Where I have the following viruses inside virus.txt

Allpahuayo mammarenavirus 
Argentinian mammarenavirus 
Bear Canyon mammarenavirus
Brazilian mammarenavirus
Cali mammarenavirus
  

When I use it I get the following error

grep: mammarenavirus: No such file or directory grep: mammarenavirus: No such file or directory grep: Canyon: No such file or directory

So I removed the term 'mammarenavirus' and reran the command. Now I am getting unrelated results in addition to search terms. What am I doing wrong? :(

ADD REPLYlink written 5 days ago by venura10
1

Try

cat virus.txt | while read line; do grep -w "${line}" assembly_summary_refseq.txt >> resultfile.txt ; done
ADD REPLYlink written 5 days ago by genomax80k

Appreciate it! Thank you! I went forward and used the extracted assemblies using the following command.

cat assembly.txt | while read line; do esearch -db assembly -query "${line}" | esummary | xtract -pattern DocumentSummary -element FtpPath_RefSeq | xargs -n 1 sh -c 'wget -r -nd --no-parent -A "*table.txt.gz" "$0"' ; done

It retrieved only one feature file though assembly.txt contained five assembly IDs. Following is the log after running the command.

--2020-03-26 21:25:04-- ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/879/015/GCF_000879015.1_ViralMultiSegProj28321 => '.listing' Resolving ftp.ncbi.nlm.nih.gov ftp.ncbi.nlm.nih.gov)... 130.14.250.12, 2607:f220:41e:250::13 Connecting to ftp.ncbi.nlm.nih.gov ftp.ncbi.nlm.nih.gov)|130.14.250.12|:21... connected. Logging in as anonymous ... Logged in! ==> SYST ... done. ==> PWD ... done. ==> TYPE I ... done. ==> CWD (1) /genomes/all/GCF/000/879/015 ... done. ==> PASV ... done. ==> LIST ... done.

.listing [ <=>
] 96 --.-KB/s in 0.002s

2020-03-26 21:25:04 (40.5 KB/s) - '.listing' saved [96]

Removed '.listing'. --2020-03-26 21:25:04-- ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/879/015/GCF_000879015.1_ViralMultiSegProj28321/GCF_000879015.1_ViralMultiSegProj28321 => '.listing' ==> CWD (1) /genomes/all/GCF/000/879/015/GCF_000879015.1_ViralMultiSegProj28321 ... done. ==> PASV ... done. ==> LIST ... done.

.listing [ <=>
] 1.66K --.-KB/s in 0.009s

2020-03-26 21:25:05 (176 KB/s) - '.listing' saved [1698]

Removed '.listing'. Rejecting 'GCF_000879015.1_ViralMultiSegProj28321_assembly_report.txt'. Rejecting 'GCF_000879015.1_ViralMultiSegProj28321_assembly_stats.txt'. Rejecting 'GCF_000879015.1_ViralMultiSegProj28321_cds_from_genomic.fna.gz'. Rejecting 'GCF_000879015.1_ViralMultiSegProj28321_feature_count.txt.gz'. Rejecting 'GCF_000879015.1_ViralMultiSegProj28321_genomic.fna.gz'. Rejecting 'GCF_000879015.1_ViralMultiSegProj28321_genomic.gbff.gz'. Rejecting 'GCF_000879015.1_ViralMultiSegProj28321_genomic.gff.gz'. Rejecting 'GCF_000879015.1_ViralMultiSegProj28321_genomic.gtf.gz'. Rejecting 'GCF_000879015.1_ViralMultiSegProj28321_protein.faa.gz'. Rejecting 'GCF_000879015.1_ViralMultiSegProj28321_protein.gpff.gz'. Rejecting 'GCF_000879015.1_ViralMultiSegProj28321_translated_cds.faa.gz'. Rejecting 'README.txt'. Rejecting 'annotation_hashes.txt'. Rejecting 'assembly_status.txt'. Rejecting 'md5checksums.txt'. --2020-03-26 21:25:05-- ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/879/015/GCF_000879015.1_ViralMultiSegProj28321/GCF_000879015.1_ViralMultiSegProj28321_feature_table.txt.gz => 'GCF_000879015.1_ViralMultiSegProj28321_feature_table.txt.gz.1' ==> CWD not required. ==> PASV ... done. ==> RETR GCF_000879015.1_ViralMultiSegProj28321_feature_table.txt.gz ... done. Length: 446

GCF_000879015.1_ViralMultiSegProj28321_fe 100%[====================================================================================>] 446 --.-KB/s in 0.02s

2020-03-26 21:25:05 (28.4 KB/s) - 'GCF_000879015.1_ViralMultiSegProj28321_feature_table.txt.gz.1' saved [446]

FINISHED --2020-03-26 21:25:05-- Total wall clock time: 1.1s Downloaded: 1 files, 446 in 0.03s (16.1 KB/s)

ADD REPLYlink modified 5 days ago • written 5 days ago by venura10
1

Take a look at the 'While Loop' subsection here. Essentially, you need to add < /dev/null to the esearch command as follows:

cat assembly.txt \
  | while read line; do 
      esearch -db assembly -query "${line}"  < /dev/null \
      | esummary | xtract -pattern DocumentSummary -element FtpPath_RefSeq \
      | xargs -n 1 sh -c 'wget -r -nd --no-parent -A "*table.txt.gz" "$0"' ; 
  done
ADD REPLYlink written 4 days ago by vkkodali1.9k

Sweet! Worked like a charm! Thank you @vkkodali

ADD REPLYlink written 4 days ago by venura10
0
gravatar for vkkodali
5 days ago by
vkkodali1.9k
United States
vkkodali1.9k wrote:

You can use the NCBI Assembly portal for this. If you know the organism name or better yet an NCBI Taxonomy identifier, you can query the assembly database with that and click on the 'Representative genomes' filter on the left hand side. Then, use the 'Download Assemblies' button to download data. I recommend exploring the files that you can download as the data you seek are spread out across a few files. For example, gene and protein IDs would be in the GFF3 files; RefSeq identifiers and chromosome identifiers can be found in the assembly_report.txt file and so on.

If you want to do this using command line options, you can use esearch and esummary to get the FTP paths and then download the files of your interest using wget or curl. See: www.biostars.org/p/428504/ for some ideas.

ADD COMMENTlink written 5 days ago by vkkodali1.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1048 users visited in the last hour