I have a large list of NCBI accession numbers (~2000) in a TSV file, and I would like to generate a list of TaxId matches. After trying a couple of options, I settled on the bash loop presented in this post. I'll reproduce my slightly modified version here:
#!/bin/bash for acc in `cat <file.tsv>` ; do efetch -format docsum -db nuccore -id $acc \ | xtract -pattern DocumentSummary -element AccessionVersion,TaxId \ >> <outfile.txt>; done
When I run this loop, I get the following error:
ERROR in esummary: invalid uid cat at position=0 XM_022318843.1
When I enter this accession into the ncbi search bar, it turns up a valid entry. All other accessions are successfully matched to TaxIds, but the accession named in the error is left out, which disrupts the order of accessions in the TSV file. Even weirder, if I run the script multiple times, I get the same error, but for different accession numbers. On my third try, the script ran with no errors and from what I can tell, the output is exactly what I need.
Technically the problem is solved, but this is still a mystery and I would like an answer. If there's some weird quirk of Entrez Direct or my script, I would like to know in case it might cause issues downstream.