Question: Get genomes in genbank that are NOT in RefSeq
0
gravatar for rororo
7 months ago by
rororo0
rororo0 wrote:

NCBI has two sections for assemblies, genbank (all submitted sequences) and RefSeq (curated genbank sequences).

A list of both is available here: ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt and ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/assembly_summary_genbank.txt.

Now I want to get a list of assembly (genome) accession numbers, that are in genbank but not in RefSeq. Unfortunately I could not find any mapping file on NCBI's sites. Has someone an idea how to obtain that list?

genbank refseq ncbi genome • 252 views
ADD COMMENTlink modified 7 months ago by vkkodali2.0k • written 7 months ago by rororo0

Unless I'm missing a trick, this should be as simple as something like:

comm -23 assembly_summary_genbank.txt assembly_summary_refseq.txt

Haven't double checked that this is 100% accurate though, and assumes I got the files the right way round!

ADD REPLYlink modified 7 months ago • written 7 months ago by Joe16k
1
gravatar for vkkodali
7 months ago by
vkkodali2.0k
United States
vkkodali2.0k wrote:

The assembly_summary_genbank.txt file has a field gbrs_paired_asm which indicates whether there is a matched RefSeq pair for a given GenBank assembly. You should be able to get the entire list of assemblies without a matching RefSeq assembly as follows:

awk 'BEGIN{FS="\t";OFS="\t"}($18=="na")' assembly_summary_genbank.txt
ADD COMMENTlink written 7 months ago by vkkodali2.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1252 users visited in the last hour