Question

Getting DNA sequence from a genome knowing the start and end position

0

Entering edit mode

5.1 years ago

PaSua • 0

Python newby here.

I was wondering if there is a way of getting the sequence of a genome from NCBI giving a point of start and end. For instance, I'm working with this genome ID (NC_011375.1) and I would like to obtain the sequence that is between 259882 and 259896 bases. So far, I have this:

 from Bio import Entrez 
 from Bio import SeqIO

Entrez.email = "my@email.org"

handle = Entrez.efetch(db="nuccore",
                   id="NC_011375.1",
                   rettype="gb",
                   retmode="text")

whole_sequence = SeqIO.read(handle, "genbank")

print whole_sequence[259882:259896]

And this is the output I get:

ID: NC_011375.1
Name: NC_011375
Description: Streptococcus pyogenes NZ131, complete genome.
Number of features: 0
UnknownSeq(14, alphabet = IUPACAmbiguousDNA(), character = 'N')

As you can see, it´s not working. Since I don´t know how to proceed, any help would be appreciated.

Thank you in advance.

start end sequence genbank position • 1.0k views

ADD COMMENT • link updated 5.1 years ago by Pierre Lindenbaum 161k • written 5.1 years ago by PaSua • 0

1

Entering edit mode

I don't know the syntax for this command, but keep in mind that Python uses 0-based indexing, so the first base is actually in position 0 not 1- you must adjust accordingly.

ADD REPLY • link 5.1 years ago by jean.elbers ★ 1.7k

0

Entering edit mode

Solved. I wasn´t using the correct ID (it needs to be a CP reference, not a NC_). Anyway, thank you because I needed to adjust the position accordingly to Python indexing, as you said.

I put the solution here hopping someone will find it useful:

from Bio import Entrez 
from Bio import SeqIO

Entrez.email = "my@email.org"

handle = Entrez.efetch(db="nuccore",
                   id="CP000829",
                   rettype="gb",
                   retmode="text")

whole_sequence = SeqIO.read(handle, "genbank")

print whole_sequence[259881:259896]

output:

ID: CP000829.1
Name: CP000829
Description: Streptococcus pyogenes NZ131, complete genome.
Number of features: 0
Seq('AATATTCAGATAATT', IUPACAmbiguousDNA())

ADD REPLY • link 5.1 years ago by PaSua • 0

score 4 · Accepted Answer · 2019-03-05

4

Entering edit mode

5.1 years ago

Pierre Lindenbaum 161k

$ wget -q -O -  "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=NC_011375.1&rettype=fasta&seq_start=259882&seq_stop=259896" 

>NC_011375.1:259882-259896 Streptococcus pyogenes NZ131, complete genome
AATATTCAGATAATT

ADD COMMENT • link 5.1 years ago by Pierre Lindenbaum 161k