Hello,
I have a large list of NCBI accession numbers (~2000) in a TSV file, and I would like to generate a list of TaxId matches. After trying a couple of options, I settled on the bash loop presented in this post. I'll reproduce my slightly modified version here:
#!/bin/bash
for acc in `cat <file.tsv>` ; do
efetch -format docsum -db nuccore -id $acc \
| xtract -pattern DocumentSummary -element AccessionVersion,TaxId \
>> <outfile.txt>;
done
When I run this loop, I get the following error:
ERROR in esummary: invalid uid cat at position=0 XM_022318843.1
When I enter this accession into the ncbi search bar, it turns up a valid entry. All other accessions are successfully matched to TaxIds, but the accession named in the error is left out, which disrupts the order of accessions in the TSV file. Even weirder, if I run the script multiple times, I get the same error, but for different accession numbers. On my third try, the script ran with no errors and from what I can tell, the output is exactly what I need.
Technically the problem is solved, but this is still a mystery and I would like an answer. If there's some weird quirk of Entrez Direct or my script, I would like to know in case it might cause issues downstream.
I ran:
So it works fine - I'm guessing your input is malformed. Try running
grep "XM_022318843.1" file.tsv | cat -te
and see if it prints any unexpected invisible characters.Hi. As I mentioned in the question, when I ran the script a second time I didn't get an error for that accession, it was matched successfully. I assumed based on this that my file was formatted correctly, but that there's an issue within Entrez Direct that's causing certain accessions to be dropped, seemingly at random.
I apologize for not reading your question well. As you say, it does seem to be a server-side problem. The API key that genomax mentions should solve your problem.