Question: Retrieve all sequence ids from a master record
3
gravatar for john
14 months ago by
john70
European Union
john70 wrote:

Consider the following genome entry in NCBI.

https://www.ncbi.nlm.nih.gov/nuccore/NZ_ABAX00000000.3

This is a master entry for the assembly project of a bacteria. As you can see on the top its sais that is does not contain any sequence. The genomic sequence is distributed across multiple other entries (NZ_DS499719-NZ_DS499744) one entry per assembled scaffold.

Hence if I use for example the following entrez querry:

esearch -db nuccore -query 'NZ_ABAX00000000.3' | efetch -format fasta

Returns a fasta file with just "N".

My question are the following:

  • How can I identify automatically if a NCBI entry is such a master entry?
  • How can I get automatically all entries for a master entry? (In this case NZ_DS499719-NZ_DS499744)
assembly ncbi • 692 views
ADD COMMENTlink modified 14 months ago • written 14 months ago by john70
4
gravatar for a.zielezinski
14 months ago by
a.zielezinski8.6k
a.zielezinski8.6k wrote:

How can I identify automatically if a NCBI entry is such a master entry?

Master records have distinguishable accession numbers. Each master record consists of a four-letter prefix followed by zeroes. The number of zeroes can be different - it increases to nine for Whole Genome Shotgun projects with one million or more contigs. In order to programatically distinguish master records from normal records you can use regular expressions. For example, here is a Python function that takes as input an accession number and returns True if it belongs to the master record.

import re

def is_master_record(accession):
    return bool(re.search('[A-Z]{4}0+(\.\d){0,}$', accession))

Little validaton:

NZ_ABAX000000000.2 True
NZ_ABAX000000000 True
NZ_ABAX00003200 False
NZ_YYYY00000 True
NZ_ABAX0000.1 True
NZ_DS499731.1 False
NZ_AAAAAA0000 True

How can I get automatically all entries for a master entry? (In this case NZ_DS499719-NZ_DS499744)

The following command will give you all Genbank and Refseq records related to the master entry NZ_ABAX00000000.3.

esearch -db genome -query NZ_ABAX00000000.3 | elink -target assembly | elink -target nuccore | efetch -format fasta

If you want Refseq entries only (NZ_DS499719-NZ_DS499744), you can filter the list using efilter.

esearch -db genome -query NZ_ABAX00000000.3 | elink -target assembly | elink -target nuccore | efilter -query "refseq[Filter]" | efetch -format fasta

As a result, you will get FASTA sequences for the following entries:

NZ_DS499744.1 Anaerostipes caccae DSM 14662 Scfld_03_25, whole genome shotgun sequence
NZ_DS499743.1 Anaerostipes caccae DSM 14662 Scfld_03_24, whole genome shotgun sequence
NZ_DS499742.1 Anaerostipes caccae DSM 14662 Scfld_03_23, whole genome shotgun sequence
NZ_DS499741.1 Anaerostipes caccae DSM 14662 Scfld_03_22, whole genome shotgun sequence
NZ_DS499740.1 Anaerostipes caccae DSM 14662 Scfld_03_21, whole genome shotgun sequence
NZ_DS499739.1 Anaerostipes caccae DSM 14662 Scfld_03_20, whole genome shotgun sequence
NZ_DS499738.1 Anaerostipes caccae DSM 14662 Scfld_03_19, whole genome shotgun sequence
NZ_DS499737.1 Anaerostipes caccae DSM 14662 Scfld_03_18, whole genome shotgun sequence
NZ_DS499736.1 Anaerostipes caccae DSM 14662 Scfld_03_17, whole genome shotgun sequence
NZ_DS499735.1 Anaerostipes caccae DSM 14662 Scfld_03_16, whole genome shotgun sequence
NZ_DS499734.1 Anaerostipes caccae DSM 14662 Scfld_03_15, whole genome shotgun sequence
NZ_DS499733.1 Anaerostipes caccae DSM 14662 Scfld_03_14, whole genome shotgun sequence
NZ_DS499732.1 Anaerostipes caccae DSM 14662 Scfld_03_13, whole genome shotgun sequence
NZ_DS499731.1 Anaerostipes caccae DSM 14662 Scfld_03_12, whole genome shotgun sequence
NZ_DS499730.1 Anaerostipes caccae DSM 14662 Scfld_03_11, whole genome shotgun sequence
NZ_DS499729.1 Anaerostipes caccae DSM 14662 Scfld_03_10, whole genome shotgun sequence
NZ_DS499728.1 Anaerostipes caccae DSM 14662 Scfld_03_9, whole genome shotgun sequence
NZ_DS499727.1 Anaerostipes caccae DSM 14662 Scfld_03_8, whole genome shotgun sequence
NZ_DS499726.1 Anaerostipes caccae DSM 14662 Scfld_03_7, whole genome shotgun sequence
NZ_DS499725.1 Anaerostipes caccae DSM 14662 Scfld_03_6, whole genome shotgun sequence
NZ_DS499724.1 Anaerostipes caccae DSM 14662 Scfld_03_5, whole genome shotgun sequence
NZ_DS499723.1 Anaerostipes caccae DSM 14662 Scfld_03_4, whole genome shotgun sequence
NZ_DS499722.1 Anaerostipes caccae DSM 14662 Scfld_03_3, whole genome shotgun sequence
NZ_DS499721.1 Anaerostipes caccae DSM 14662 Scfld_03_2, whole genome shotgun sequence
NZ_DS499720.1 Anaerostipes caccae DSM 14662 Scfld_03_1, whole genome shotgun sequence
NZ_DS499719.1 Anaerostipes caccae DSM 14662 Scfld_03_0, whole genome shotgun sequence
ADD COMMENTlink modified 14 months ago • written 14 months ago by a.zielezinski8.6k

Sweet.

That is a nice answer for the second question. But I am still not completly satisfied.

The problem is that some of my entries are fully assembled genomes, like NC_004663.1. Others like NZ_ABAX00000000.3 are just the record for the assembly project. Hence I cant use both with the same entrez call.

It seems like that all accessions of assembly projects contain eight zeros in there accession but I am not sure if this is consistent. To build a script that handles both I need a way to discriminate them from each other.

ADD REPLYlink written 14 months ago by john70
1

Sorry, somehow I missed your first question. I've just updated my answer.

ADD REPLYlink modified 14 months ago • written 14 months ago by a.zielezinski8.6k

Hey I found a master record (NZ_ARET00000000.1) which does not work with your querry.

esearch -db genome -query "NZ_ARET00000000.1" | elink -target assembly | elink -target nuccore | efilter -query "refseq[Filter]" | efetch -format acc

In the master record it says just the accession NZ_KB892637-NZ_KB892704 are the correct scaffolds of the project. But the query returns much more accessions.

NZ_AUUC01000, NZ_KE3922, NZ_JPJF01000, NZ_PQGB0100

Is the cause for this problem the not correct maintained db of ncbi?

ADD REPLYlink written 14 months ago by john70

Hmm.. that's interesting. I truly don't know how to interpret this case. I would also guess that this may have something to do with the database maintenance. Can we somehow deduce which of these two results is true - information in the master record (NZ_KB892637-NZ_KB892704) or results returned by esearch (NZ_KB892637-NZ_KB892704 + extra records)? The worst case scenario is to contact NCBI on this issue.

ADD REPLYlink modified 14 months ago • written 14 months ago by a.zielezinski8.6k

That sounds really complicated. I will message NCBI, will see what they say.

ADD REPLYlink written 14 months ago by john70
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 872 users visited in the last hour