Entrez Direct not recognizing specific NCBI Accession numbers
1
0
Entering edit mode
3.9 years ago
Myzus • 0

Hello,

I have a large list of NCBI accession numbers (~2000) in a TSV file, and I would like to generate a list of TaxId matches. After trying a couple of options, I settled on the bash loop presented in this post. I'll reproduce my slightly modified version here:

#!/bin/bash

for acc in `cat <file.tsv>` ; do
  efetch -format docsum -db nuccore -id $acc \
        | xtract -pattern DocumentSummary -element AccessionVersion,TaxId \ 
        >> <outfile.txt>;
done

When I run this loop, I get the following error:

ERROR in esummary: invalid uid cat at position=0 XM_022318843.1

When I enter this accession into the ncbi search bar, it turns up a valid entry. All other accessions are successfully matched to TaxIds, but the accession named in the error is left out, which disrupts the order of accessions in the TSV file. Even weirder, if I run the script multiple times, I get the same error, but for different accession numbers. On my third try, the script ran with no errors and from what I can tell, the output is exactly what I need.

Technically the problem is solved, but this is still a mystery and I would like an answer. If there's some weird quirk of Entrez Direct or my script, I would like to know in case it might cause issues downstream.

entrez direct ncbi • 1.7k views
ADD COMMENT
1
Entering edit mode

I ran:

➜ efetch -format docsum -db nuccore -id XM_022318843.1

So it works fine - I'm guessing your input is malformed. Try running grep "XM_022318843.1" file.tsv | cat -te and see if it prints any unexpected invisible characters.

ADD REPLY
0
Entering edit mode

Hi. As I mentioned in the question, when I ran the script a second time I didn't get an error for that accession, it was matched successfully. I assumed based on this that my file was formatted correctly, but that there's an issue within Entrez Direct that's causing certain accessions to be dropped, seemingly at random.

ADD REPLY
0
Entering edit mode

I apologize for not reading your question well. As you say, it does seem to be a server-side problem. The API key that genomax mentions should solve your problem.

ADD REPLY
1
Entering edit mode
3.9 years ago
GenoMax 142k

You should have one accession per line in the input file. Using epost method is the best option when using multiple queries.

$ cat acc.txt

XM_022318843
XM_022318845
XM_022328847

$ epost -db nuccore -input acc.txt -format acc | efetch -format docsum -db nuccore  | xtract -pattern DocumentSummary -element AccessionVersion,TaxId

XM_022328847.1  108931
XM_022318845.1  13164
XM_022318843.1  13164
ADD COMMENT
0
Entering edit mode

Thanks, I tried using epost but it removes duplicate entries from my input file, which causes problems for me further down the line. I couldn't figure out why this is or how to fix it, hence the bash loop.

ADD REPLY
0
Entering edit mode

Have you signed up for an NCBI_API_KEY? If you have not then you should do that first. Instead of using a tab separated value file (if .tsv refers to that in your script) use one value per line and see if that helps.

ADD REPLY
0
Entering edit mode

I have not signed up for a key. Thanks for pointing this out, I'm very new to the field and the NCBI documentation is a bit dense.

ADD REPLY

Login before adding your answer.

Traffic: 1407 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6