NCBI E-eutilitis not working properly inside a while loop
3
1
Entering edit mode
2.3 years ago

Hello everyone

I have the following loop to get the TaxID from a list (aaa.txt) of Assembly Accession.

head aaa.txt
GCF_016924235.1
GCF_003999975.1
GCF_002993365.1
GCF_017084525.1
GCF_006496635.1
GCF_017104785.1

The loop:

cat aaa.txt | while read p; do 
    echo $p; 
    esearch -db assembly -query $p | esummary | xtract -pattern DocumentSummary -def "NA" -element AssemblyAccession,Taxid > out.txt; 
done;

I can't figure out why the loop stops after the first line iteration.

Thank you

A

E-eutilitis NCBI • 1.9k views
ADD COMMENT
4
Entering edit mode

probably your while loop needs /dev/null redirection as explained here: https://www.ncbi.nlm.nih.gov/books/NBK179288/. Otherwise, it will stop at first line/record.

ADD REPLY
2
Entering edit mode

It works!

cat aaa.txt | while read p; do 
    echo $p; 
    esearch -db assembly -query $p < /dev/null | 
    esummary | 
    xtract -pattern DocumentSummary -def "NA" -element AssemblyAccession,Taxid > $p.txt
    sleep 3s 
done;

I have also added a sleep 3s to avoid any problem with NCBI.

Thank you!

ADD REPLY
0
Entering edit mode

Be careful here. Requests in a loop will hammer the server and could get your IP (or that of your institution) banned. I think now the NCBI is rate limiting access to 3 requests per second to avoid this kind of problem but you could still get banned.

ADD REPLY
0
Entering edit mode

I have thousands of accessions. Perhaps I need to find a different solution.

Thank you for warning me.

ADD REPLY
0
Entering edit mode

It may be simple to just get the assembly summary report file and extract the taxID's. They are in column 6 and 7. You can either search by GCA* or GCF* accessions. There are other reports in that directory if you need RefSeq etc.

$ grep GCF_016924235 assembly_summary_genbank.txt 
GCA_016924235.1 PRJNA575502     SAMN17818122    JAFFZP000000000.1       representative genome   2811233 2811233 Amphritea pacifica      strain=RP18W            latest  Scaffold        Major   Full    2021/02/23 ASM1692423v1     Guangdong Institute of Microbiology     GCF_016924235.1 identical       https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/016/924/235/GCA_016924235.1_ASM1692423v1                   na
ADD REPLY
0
Entering edit mode

Yup, that's what I mentioned in my answer. But one caveat with this file is that it only includes assemblies that are current; the older assemblies are in assembly_summary_{genbank,refseq}_historical.txt file in the same directory. If the input list has a mix of current and old assemblies, the best way to go is to first concatenate the two files and then do the grep.

ADD REPLY
0
Entering edit mode

I did not read all the answers completely. Sorry about that. Moving mine to a comment.

I think downloading this file once and doing the searches locally would save NCBI bandwidth/server load, especially if OP has literally thousands of these to look through.

ADD REPLY
0
Entering edit mode
$ grep -f aaa.txt assembly_summary_genbank.txt | awk -F "\t" -v OFS="\t" '{print $18,$6}'
$ parallel --plus grep {} assembly_summary_genbank.txt :::: aaa.txt |  awk -F "\t" -v OFS="\t" '{print $18,$6}'
ADD REPLY
5
Entering edit mode
2.3 years ago
vkkodali_ncbi ★ 3.7k

As @cpad0112 suggested, adding < /dev/null to your esearch is the way to go if you want to use the while loop. I recommend skipping the while loop altogether and use epost as shown below. It is quicker.

$ time cat accs.txt | while read p ; do esearch -db assembly -query $p < /dev/null | esummary | xtract -pattern DocumentSummary -def "NA" -element AssemblyAccession,Taxid ; done > out.txt
real    0m5.361s
user    0m1.828s
sys     0m0.553s

$ time epost -db assembly -input accs.txt | esummary | xtract -pattern DocumentSummary -def "NA" -element AssemblyAccession,Taxid > out.txt
real    0m1.567s
user    0m0.375s
sys     0m0.150s

If you have a lot of accessions and you are concerned about hitting the rate limit, I suggest you create an NCBI API key. See https://support.nlm.nih.gov/knowledgebase/article/KA-05316/en-us and the "Programmatic Access" section of https://www.ncbi.nlm.nih.gov/books/NBK179288/ for more details.

Finally, if you don't want to deal with EntrezDirect at all because it still takes a long time to process thousands of accessions you have, you can obtain this information from the assembly report files located on the FTP here: ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS. For example, the file assembly_summary_refseq.txt is a tab-delimited table with 23 fields including assembly accession and taxid.

You should be able to join on the assembly accession to this table:

$ join -j 1 <(sort accs.txt) <(grep -v '^#' assembly_summary_refseq.txt | sort -k1,1 -t $'\t') -t $'\t' | cut -f1,6
GCF_002993365.1 2079529
GCF_003999975.1 2486853
GCF_006496635.1 2027405
GCF_016924235.1 2811233
GCF_017084525.1 2812560
GCF_017104785.1 2703894

Note, assembly_summary_refseq.txt contains only the latest assembly accessions. Older data are in assembly_summary_refseq_historical.txt file located in the same FTP path. If the input list has a mix of current and old assemblies, the best way to go is to first concatenate the two files and then do the join.

ADD COMMENT
6
Entering edit mode
2.3 years ago
MirianT_NCBI ▴ 730

Hi,
Another option would be to use the summary option from NCBI Datasets in combination with jq. Here's the command:

datasets summary genome accession --inputfile aaa.txt | jq -r '.assemblies[].assembly | [.assembly_accession,.org.tax_id] | @tsv' > out.txt

From the list you provided as an example, here's the output:

GCF_003999975.1 2486853
GCF_002993365.1 2079529
GCF_016924235.1 2811233
GCF_006496635.1 2027405
GCF_017104785.1 2703894
GCF_017084525.1 2812560

If you prefer the output as csv, you can change it in the last part of the command (@csv instead of @tsv).

Let me know if you have any questions or if you find any issues. :)

ADD COMMENT
2
Entering edit mode
2.3 years ago

Given that there's a limit of 3 requests per second and your command is already making at least two (i.e. esearch and esummary), you won't complete the second iteration unless a full second has passed after the first.

ADD COMMENT

Login before adding your answer.

Traffic: 2626 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6