Question: Get genomes in genbank that are NOT in RefSeq
NCBI has two sections for assemblies, genbank (all submitted sequences) and RefSeq (curated genbank sequences).

A list of both is available here: and

Now I want to get a list of assembly (genome) accession numbers, that are in genbank but not in RefSeq. Unfortunately I could not find any mapping file on NCBI's sites. Has someone an idea how to obtain that list?

Unless I'm missing a trick, this should be as simple as something like:

comm -23 assembly_summary_genbank.txt assembly_summary_refseq.txt

Haven't double checked that this is 100% accurate though, and assumes I got the files the right way round!

The assembly_summary_genbank.txt file has a field gbrs_paired_asm which indicates whether there is a matched RefSeq pair for a given GenBank assembly. You should be able to get the entire list of assemblies without a matching RefSeq assembly as follows:

awk 'BEGIN{FS="\t";OFS="\t"}($18=="na")' assembly_summary_genbank.txt
