Question: Get genomes in genbank that are NOT in RefSeq
gravatar for rororo
7 months ago by
rororo0 wrote:

NCBI has two sections for assemblies, genbank (all submitted sequences) and RefSeq (curated genbank sequences).

A list of both is available here: and

Now I want to get a list of assembly (genome) accession numbers, that are in genbank but not in RefSeq. Unfortunately I could not find any mapping file on NCBI's sites. Has someone an idea how to obtain that list?

genbank refseq ncbi genome • 252 views
ADD COMMENTlink modified 7 months ago by vkkodali2.0k • written 7 months ago by rororo0

Unless I'm missing a trick, this should be as simple as something like:

comm -23 assembly_summary_genbank.txt assembly_summary_refseq.txt

Haven't double checked that this is 100% accurate though, and assumes I got the files the right way round!

ADD REPLYlink modified 7 months ago • written 7 months ago by Joe16k
gravatar for vkkodali
7 months ago by
United States
vkkodali2.0k wrote:

The assembly_summary_genbank.txt file has a field gbrs_paired_asm which indicates whether there is a matched RefSeq pair for a given GenBank assembly. You should be able to get the entire list of assemblies without a matching RefSeq assembly as follows:

awk 'BEGIN{FS="\t";OFS="\t"}($18=="na")' assembly_summary_genbank.txt
ADD COMMENTlink written 7 months ago by vkkodali2.0k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1252 users visited in the last hour