Question

Entrez Direct not recognizing specific NCBI Accession numbers

0

Entering edit mode

3.9 years ago

Myzus • 0

Hello,

I have a large list of NCBI accession numbers (~2000) in a TSV file, and I would like to generate a list of TaxId matches. After trying a couple of options, I settled on the bash loop presented in this post. I'll reproduce my slightly modified version here:

#!/bin/bash

for acc in `cat <file.tsv>` ; do
  efetch -format docsum -db nuccore -id $acc \
        | xtract -pattern DocumentSummary -element AccessionVersion,TaxId \ 
        >> <outfile.txt>;
done

When I run this loop, I get the following error:

ERROR in esummary: invalid uid cat at position=0 XM_022318843.1

When I enter this accession into the ncbi search bar, it turns up a valid entry. All other accessions are successfully matched to TaxIds, but the accession named in the error is left out, which disrupts the order of accessions in the TSV file. Even weirder, if I run the script multiple times, I get the same error, but for different accession numbers. On my third try, the script ran with no errors and from what I can tell, the output is exactly what I need.

Technically the problem is solved, but this is still a mystery and I would like an answer. If there's some weird quirk of Entrez Direct or my script, I would like to know in case it might cause issues downstream.

entrez direct ncbi • 1.7k views

ADD COMMENT • link updated 3.9 years ago by GenoMax 142k • written 3.9 years ago by Myzus • 0

1

Entering edit mode

I ran:

➜ efetch -format docsum -db nuccore -id XM_022318843.1

So it works fine - I'm guessing your input is malformed. Try running grep "XM_022318843.1" file.tsv | cat -te and see if it prints any unexpected invisible characters.

ADD REPLY • link 3.9 years ago by Ram 43k

0

Entering edit mode

Hi. As I mentioned in the question, when I ran the script a second time I didn't get an error for that accession, it was matched successfully. I assumed based on this that my file was formatted correctly, but that there's an issue within Entrez Direct that's causing certain accessions to be dropped, seemingly at random.

ADD REPLY • link 3.9 years ago by Myzus • 0

0

Entering edit mode

I apologize for not reading your question well. As you say, it does seem to be a server-side problem. The API key that genomax mentions should solve your problem.

ADD REPLY • link 3.9 years ago by Ram 43k

score 1 · Answer 1 · 2020-07-11

1

Entering edit mode

3.9 years ago

GenoMax 142k

You should have one accession per line in the input file. Using epost method is the best option when using multiple queries.

$ cat acc.txt

XM_022318843
XM_022318845
XM_022328847

$ epost -db nuccore -input acc.txt -format acc | efetch -format docsum -db nuccore  | xtract -pattern DocumentSummary -element AccessionVersion,TaxId

XM_022328847.1  108931
XM_022318845.1  13164
XM_022318843.1  13164

ADD COMMENT • link 3.9 years ago by GenoMax 142k

0

Entering edit mode

Thanks, I tried using epost but it removes duplicate entries from my input file, which causes problems for me further down the line. I couldn't figure out why this is or how to fix it, hence the bash loop.

ADD REPLY • link 3.9 years ago by Myzus • 0

0

Entering edit mode

Have you signed up for an NCBI_API_KEY? If you have not then you should do that first. Instead of using a tab separated value file (if .tsv refers to that in your script) use one value per line and see if that helps.

ADD REPLY • link 3.9 years ago by GenoMax 142k

0

Entering edit mode

I have not signed up for a key. Thanks for pointing this out, I'm very new to the field and the NCBI documentation is a bit dense.

ADD REPLY • link 3.8 years ago by Myzus • 0