Question: Entrez Direct not recognizing specific NCBI Accession numbers
0
gravatar for Myzus
3 months ago by
Myzus0
Canada
Myzus0 wrote:

Hello,

I have a large list of NCBI accession numbers (~2000) in a TSV file, and I would like to generate a list of TaxId matches. After trying a couple of options, I settled on the bash loop presented in this post. I'll reproduce my slightly modified version here:

#!/bin/bash

for acc in `cat <file.tsv>` ; do
  efetch -format docsum -db nuccore -id $acc \
        | xtract -pattern DocumentSummary -element AccessionVersion,TaxId \ 
        >> <outfile.txt>;
done

When I run this loop, I get the following error:

ERROR in esummary: invalid uid cat at position=0 XM_022318843.1

When I enter this accession into the ncbi search bar, it turns up a valid entry. All other accessions are successfully matched to TaxIds, but the accession named in the error is left out, which disrupts the order of accessions in the TSV file. Even weirder, if I run the script multiple times, I get the same error, but for different accession numbers. On my third try, the script ran with no errors and from what I can tell, the output is exactly what I need.

Technically the problem is solved, but this is still a mystery and I would like an answer. If there's some weird quirk of Entrez Direct or my script, I would like to know in case it might cause issues downstream.

ncbi entrez direct • 200 views
ADD COMMENTlink modified 3 months ago by genomax91k • written 3 months ago by Myzus0
1

I ran:

➜ efetch -format docsum -db nuccore -id XM_022318843.1

So it works fine - I'm guessing your input is malformed. Try running grep "XM_022318843.1" file.tsv | cat -te and see if it prints any unexpected invisible characters.

ADD REPLYlink modified 3 months ago • written 3 months ago by RamRS30k

Hi. As I mentioned in the question, when I ran the script a second time I didn't get an error for that accession, it was matched successfully. I assumed based on this that my file was formatted correctly, but that there's an issue within Entrez Direct that's causing certain accessions to be dropped, seemingly at random.

ADD REPLYlink written 3 months ago by Myzus0

I apologize for not reading your question well. As you say, it does seem to be a server-side problem. The API key that genomax mentions should solve your problem.

ADD REPLYlink written 3 months ago by RamRS30k
1
gravatar for genomax
3 months ago by
genomax91k
United States
genomax91k wrote:

You should have one accession per line in the input file. Using epost method is the best option when using multiple queries.

$ cat acc.txt

XM_022318843
XM_022318845
XM_022328847

$ epost -db nuccore -input acc.txt -format acc | efetch -format docsum -db nuccore  | xtract -pattern DocumentSummary -element AccessionVersion,TaxId

XM_022328847.1  108931
XM_022318845.1  13164
XM_022318843.1  13164
ADD COMMENTlink written 3 months ago by genomax91k

Thanks, I tried using epost but it removes duplicate entries from my input file, which causes problems for me further down the line. I couldn't figure out why this is or how to fix it, hence the bash loop.

ADD REPLYlink written 3 months ago by Myzus0

Have you signed up for an NCBI_API_KEY? If you have not then you should do that first. Instead of using a tab separated value file (if .tsv refers to that in your script) use one value per line and see if that helps.

ADD REPLYlink modified 3 months ago • written 3 months ago by genomax91k

I have not signed up for a key. Thanks for pointing this out, I'm very new to the field and the NCBI documentation is a bit dense.

ADD REPLYlink written 3 months ago by Myzus0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 765 users visited in the last hour