Getting DNA sequence from a genome knowing the start and end position
2.8 years ago
PaSua • 0

Python newby here.

I was wondering if there is a way of getting the sequence of a genome from NCBI giving a point of start and end. For instance, I'm working with this genome ID (NC_011375.1) and I would like to obtain the sequence that is between 259882 and 259896 bases. So far, I have this:

 from Bio import Entrez
from Bio import SeqIO

Entrez.email = "my@email.org"

handle = Entrez.efetch(db="nuccore",
id="NC_011375.1",
rettype="gb",
retmode="text")

print whole_sequence[259882:259896]


And this is the output I get:

ID: NC_011375.1
Name: NC_011375
Description: Streptococcus pyogenes NZ131, complete genome.
Number of features: 0
UnknownSeq(14, alphabet = IUPACAmbiguousDNA(), character = 'N')


As you can see, it´s not working. Since I don´t know how to proceed, any help would be appreciated.

I don't know the syntax for this command, but keep in mind that Python uses 0-based indexing, so the first base is actually in position 0 not 1- you must adjust accordingly.

Solved. I wasn´t using the correct ID (it needs to be a CP reference, not a NC_). Anyway, thank you because I needed to adjust the position accordingly to Python indexing, as you said.

I put the solution here hopping someone will find it useful:

from Bio import Entrez
from Bio import SeqIO

Entrez.email = "my@email.org"

handle = Entrez.efetch(db="nuccore",
id="CP000829",
rettype="gb",
retmode="text")

print whole_sequence[259881:259896]


output:

ID: CP000829.1
Name: CP000829
Description: Streptococcus pyogenes NZ131, complete genome.
Number of features: 0
Seq('AATATTCAGATAATT', IUPACAmbiguousDNA())

2.8 years ago
\$ wget -q -O -  "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=NC_011375.1&rettype=fasta&seq_start=259882&seq_stop=259896"

>NC_011375.1:259882-259896 Streptococcus pyogenes NZ131, complete genome
AATATTCAGATAATT