Extract NCBI's refseq assembly accession number from nuccore IDs
2
0
Entering edit mode
2.9 years ago

Hey guys,

I have a list of nuccore IDs in a text file (let's call it file.txt), and want to append the NCBI's refseq assembly accession number next to the nuccore ID, such as this

GCF_000006765.1_NC_002516.2

I've tried with the following command, but only the NCBI's refseq assembly accession number shows up

for file in $(cat file.txt) ; do esearch -db nuccore -query "$file" | elink -db assembly -target assembly | esummary | xtract -pattern DocumentSummary -element Caption,AssemblyAccession,BioSample >> GCFs_nucl_accessions.txt; done

Can you help me out? Thanks!

sequence • 1.6k views
ADD COMMENT
0
Entering edit mode

What if there was no assembly for a nucleotide sequence?

ADD REPLY
0
Entering edit mode

All the nucleotide IDs I have correspond to either the chromosome or plasmids from complete bacterial genomes, so I expect each ID will have a corresponding assembly accession.

ADD REPLY
0
Entering edit mode

Ok, your query was fine, I think I fixed the shell code so it works as expected now.

ADD REPLY
0
Entering edit mode

This is the assembly database record for the ID included above. Are you looking to get NC* id based on the GCF ID? GCF ID's are RefSeq ID's by the way they are not nuccore ID's.

ADD REPLY
0
Entering edit mode

I have a list of NC* IDs, and want to append the corresponding GCF ID to each one.

ADD REPLY
0
Entering edit mode

Please post more than one example. It is always good to do this when you ask questions about ID's. You can simply do this to get the GCF ID:

$ esearch -db assembly -query "NC_002516.2"  | esummary | xtract -pattern DocumentSummary -element RefSeq
GCF_000006765.1
ADD REPLY
0
Entering edit mode

Sure, here's the top 5 IDs

NC_002774.1
NC_003140.1
NC_005951.1
NC_006625.1
NC_007790.1
ADD REPLY
1
Entering edit mode
2.9 years ago
GenoMax 141k

Using EntrezDirect:

$ more id

NC_002774.1
NC_003140.1
NC_005951.1
NC_006625.1
NC_007790.1

$ for i in `cat id`; do printf ${i}"\t"; esearch -db assembly -query ${i}  | esummary | xtract -pattern DocumentSummary -element RefSeq; done
NC_002774.1 GCF_000009665.1
NC_003140.1 GCF_000009645.1
NC_005951.1 GCF_000011525.1
NC_006625.1 GCF_000009885.1
NC_007790.1 GCF_000013465.1

You can change the printed output as needed to concatenate the ID's the way you want them.

ADD COMMENT
0
Entering edit mode
2.9 years ago
Michael 54k

Your shell code was almost correct, try the following:

for file in $(cat file.txt) ; do
   echo $(esearch -db nuccore -query "$file" | \
   elink -db assembly -target assembly | \
   esummary | xtract -pattern DocumentSummary -element \ 
   Caption,AssemblyAccession,BioSample)_$file >> 
   GCFs_nucl_accessions.txt;
done
ADD COMMENT
0
Entering edit mode

Thanks for sharing this, but at least for me the code doesn't work exactly as expected; it outputs

-bash: GCF_000009665.1_NC_002774.1: command not found
-bash: GCF_000009645.1_NC_003140.1: command not found
...
ADD REPLY

Login before adding your answer.

Traffic: 2551 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6