Question

How to retrieve whole genome sequences by GenBank Ids

0

Entering edit mode

7.1 years ago

Charles Yin ▴ 180

I found that Biostars is very helpful!. The following new question puzzles me these days. I can see the genome features and its sequences from NCBI benbank (link) for access Id: DS264095, but Entriz could not retrieve the sequence. The output of the following code shows all Ns in sequence, but with the length that matches the size of the genome (1030563bp). I retrieve the corresponding gbk file, the gbk file just contains the features and CONTIGs, without actual genome sequence. Would you have any suggestions? Thank you!

from Bio import SeqIO
from Bio import Entrez

#https://www.ncbi.nlm.nih.gov/nuccore/147747968?report=genbank
handle = Entrez.efetch(db='nuccore', rettype='gb', id='DS264095',retmode='text')
for seqRecord in SeqIO.parse(handle, 'genbank'):
    seq=seqRecord.seq
    print('seq:',seq[0:100])
    print('len:',len(seq))

#The outputs:
seq:
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

len: 1030563

The retrieved gbk file is as follows, which has different content than the data shown in NCBI site.

LOCUS       DS264095             1030563 bp    DNA     linear   CON 18-MAY-2007
DEFINITION  Burkholderia mallei FMH scf_1099471655815 genomic scaffold, whole
            genome shotgun sequence.
ACCESSION   DS264095 AAIQ02000000
VERSION     DS264095.1
DBLINK      BioProject: PRJNA13987
            BioSample: SAMN02435848
KEYWORDS    WGS.
SOURCE      Burkholderia mallei FMH
  ORGANISM  Burkholderia mallei FMH
            Bacteria; Proteobacteria; Betaproteobacteria; Burkholderiales;
            Burkholderiaceae; Burkholderia; pseudomallei group.
REFERENCE   1  (bases 1 to 1030563)
  AUTHORS   DeShazer,D., Woods,D.E. and Nierman,W.C.
  TITLE     Direct Submission
  JOURNAL   Submitted (06-MAR-2007) The Institute for Genomic Research, 9712
            Medical Center Drive, Rockville, MD 20850, USA
FEATURES             Location/Qualifiers
     source          1..1030563
                     /organism="Burkholderia mallei FMH"
                     /mol_type="genomic DNA"
                     /strain="FMH"
                     /db_xref="taxon:334802"
CONTIG      join(AAIQ02000135.1:1..8136,gap(457),AAIQ02000043.1:1..44311,
            gap(1031),AAIQ02000166.1:1..3120,gap(1729),AAIQ02000194.1:1..761,
            gap(36),AAIQ02000039.1:1..45955,gap(118),AAIQ02000068.1:1..28166,
            gap(192),AAIQ02000195.1:1..749,gap(685),AAIQ02000204.1:1..289,
            gap(154),AAIQ02000163.1:1..3538,gap(375),AAIQ02000123.1:1..10313,
            gap(588),AAIQ02000142.1:1..7218,gap(466),AAIQ02000021.1:1..68386,
            gap(239),AAIQ02000069.1:1..27890,gap(395),AAIQ02000099.1:1..17802,
            gap(481),AAIQ02000038.1:1..45969,gap(717),AAIQ02000152.1:1..6039,
            gap(100),AAIQ02000162.1:1..3813,gap(349),AAIQ02000130.1:1..9302,
            gap(951),AAIQ02000104.1:1..15966,gap(744),AAIQ02000082.1:1..23397,
            gap(2853),AAIQ02000178.1:1..2005,gap(36),AAIQ02000189.1:1..1160,
            gap(36),AAIQ02000120.1:1..10635,gap(36),AAIQ02000184.1:1..1724,
            gap(489),AAIQ02000121.1:1..10540,gap(720),AAIQ02000055.1:1..34907,
            gap(378),AAIQ02000117.1:1..11883,gap(254),AAIQ02000033.1:1..54313,
            gap(288),AAIQ02000137.1:1..7858,gap(863),AAIQ02000115.1:1..12452,
            gap(592),AAIQ02000009.1:1..106604,gap(722),AAIQ02000149.1:1..6242,
            gap(593),AAIQ02000186.1:1..1381,gap(36),AAIQ02000169.1:1..2881,
            gap(468),AAIQ02000148.1:1..6247,gap(437),AAIQ02000164.1:1..3492,
            gap(464),AAIQ02000126.1:1..10017,gap(636),AAIQ02000141.1:1..7280,
            gap(731),AAIQ02000174.1:1..2399,gap(36),AAIQ02000173.1:1..2519,
            gap(246),AAIQ02000013.1:1..98333,gap(237),AAIQ02000168.1:1..2885,
            gap(278),AAIQ02000106.1:1..15267,gap(583),AAIQ02000177.1:1..2034,
            gap(495),AAIQ02000183.1:1..1748,gap(804),AAIQ02000046.1:1..41423,
            gap(357),AAIQ02000167.1:1..3108,gap(36),AAIQ02000171.1:1..2650,
            gap(36),AAIQ02000087.1:1..22096,gap(728),AAIQ02000199.1:1..476,
            gap(199),AAIQ02000180.1:1..1969,gap(36),AAIQ02000205.1:1..262,
            gap(262),AAIQ02000129.1:1..9762,gap(590),AAIQ02000160.1:1..4201,
            gap(473),AAIQ02000150.1:1..6226,gap(1027),AAIQ02000176.1:1..2096,
            gap(279),AAIQ02000032.1:1..57068,gap(491),AAIQ02000094.1:1..18896,
            gap(669),AAIQ02000058.1:1..33027,gap(36),AAIQ02000201.1:1..386,
            gap(625),AAIQ02000125.1:1..10029)
//

sequence • 2.2k views

ADD COMMENT • link updated 7.1 years ago by Ram 43k • written 7.1 years ago by Charles Yin ▴ 180

0

Entering edit mode

Please format your post appropriately in the future.

ADD REPLY • link 7.1 years ago by Ram 43k

0

Entering edit mode

Sure, I will make sure the future posts are well formatted. Thanks Ram.

ADD REPLY • link 7.1 years ago by Charles Yin ▴ 180

score 1 · Answer 1 · 2017-03-12

1

Entering edit mode

7.1 years ago

GenoMax 141k

esearch -db nuccore -query "DS264095" | efetch -format fasta returns correct sequence.

Perhaps try gbwithparts instead of just gb (which gets the scaffold record you posted above).

ADD COMMENT • link 7.1 years ago by GenoMax 141k

0

Entering edit mode

yes, it works when using 'gbwithparts' as the value for rettype. The new following code returns full genome sequence.

handle = Entrez.efetch(db='nuccore', rettype='gbwithparts', id='DS264095',retmode='text')

ADD REPLY • link 7.1 years ago by Charles Yin ▴ 180