Question: Extracting 100bp flanking sites around aligned sequences from BLAST results using Biopython or Bioperl?
gravatar for rmartson
3.3 years ago by
rmartson30 wrote:

Hello, I'm very new to programming for bioinformatics but have a task I need to be able to automate. I need to blastn search a specific nucleotide sequence (X) in the human genome. Then wherever it shows up, I need to retrieve the 100bp sequences up- and downstream of its location in the genome.

I've already learned how to run a BLAST search via Python, and how Seq objects can behave as strings. But I don't know anything about BLAST data and what kind of information is included in an xml file.

I think I could achieve this task if I had three things from BLAST output:

  1. The complete sequence for every region X is found in (let's say it's called "full_seq")

  2. The location of X in the genome (a-b)

  3. The location of the surrounding region in the genome (c-d)

Then the location of the 100bp flanking sequences would be something like full_seq[a-c-99:a-c-1] and full_seq[b-c+1:b-c+99], right?

Can these three pieces of information be retrieved from an XML file or any other format of BLAST output and how? And does Python have any way of carrying out BLAST in specific organisms? I've only seen the option to change the database i.e. non-redundant, human genomic+transcript. And I've seen an answer to this question say that you can enter the entrez query ID but entrez query and organism are two separate boxes in the BLAST search menu so I don't think that's it.

I've also heard that UCSC BLAT already returns the flanking sequences but I've never seen it.

biopython bioperl • 1.9k views
ADD COMMENTlink modified 3.3 years ago by Joe18k • written 3.3 years ago by rmartson30
gravatar for Joe
3.3 years ago by
United Kingdom
Joe18k wrote:

If you want to try this with BioPython, you can reverse engineer my script below. It acts on Genbank files in this case, but it internally converts them to FASTAs, blasts them, and then 'cuts' the sequence out based on the BLAST indices, which is what you want to do. You can also specify an 'offset' to capture sequence either side of your query.

Specifically, take a look at the notation used in these sections:

def getIndices(resultHandle):

    blast_result =, 'blast-tab')

    start = blast_result[0][0].hit_start  # This is how SearchIO from biopython accesses the first blast result [0]
    end = blast_result[0][0].hit_end

    return start, end

and then:

def slice(start, end, genbank, FPoffset, TPoffset):
    '''Subset the provided genbank to return the sub record.'''

    seqObj =, 'genbank')
    subRecord = seqObj[start-FPoffset:end+TPoffset]

    return subRecord

So here, the script will take a start and end variable, and then sub-sets the seqObj by those indices seqObj[start:end], you can see that if you minus an offset from the start (5-prime) (FPoffset) and add an offset at the 3-prime end (TPoffset), you get a small, specifiable window, either side of your blast hit. (The script defaults this to 0, so ordinarily you'd just get the query you asked for exactly.)

ADD COMMENTlink modified 3.3 years ago • written 3.3 years ago by Joe18k

Thanks, sounds like just what I need. I'll be reading through until I understand

ADD REPLYlink written 3.3 years ago by rmartson30

Happy to explain any points that might not be clear. If you already have a blast tabular file with the sequence you'd like, we can easily come up with a function that gives you back all the sequences plus an offset, but have a go first so you can learn! :)

ADD REPLYlink written 3.3 years ago by Joe18k

Hello, I've checked it out. So I don't think I need the convert() and runBlast() functions since I'm beginning with a blast output in an .xml format. I imported all the functions from Genbank_slicer and then did this:

qresults = SearchIO.parse('out.xml', 'blast-xml') 
SearchIO.write(qresults, '', 'blast-tab') 
blast_result ='', 'blast-tab')

So this will get return the start and end indices of the very first hit in the file. In my case the first hit is just the same sequence as my query so I don't get any flanks. In order to get the start and end for other hits I need to change the function so it returns blast_result[2][0].hit_start, blast_result[3][0].hit_start and so on. I guess I could use some kind of loop to do this and return the start and end for every hit. The numbers here just stand for rows and columns, right? I'm not worried so much about this part right now, although I can't figure out how to convert out.xml to a tabular format object without writing to a file and reading that first.

I'm stuck currently at the next function slice(). I don't want to slice parts out of the original genbank file, but the sequences in the blast output I got later. If I try it out on my blast output in genbank format: #This is after reading the blast results as tabular data getIndices('') slice(start, end, "", 100, 100) I get the result "NameError: name 'start' is not defined". This is kinda confusing because the variables returned by getIndices() and the arguments for slice() have the same name. But shouldn't "start" and "end" already be defined? And this function only looks at one sequence while I need to be able to load and slice every single sequence in a large genbank file.

For slice() would this work to return 100bp flanks on either side put together?

subRecord = seqObj[start-offset:start]+seqObj[end:end:+offset]

Where I've replaced FPoffset and TP offset with just offset (which is equal to 100).

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by rmartson30
gravatar for Pierre Lindenbaum
3.3 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum131k wrote:

Can these three pieces of information be retrieved from an XML file or any other format of BLAST output and how?

yes, XSLT

once you get the region extends it and call blastdbcmd to extract the region.

ADD COMMENTlink written 3.3 years ago by Pierre Lindenbaum131k

Thanks, but I'm getting a strange format using that:

ADD REPLYlink written 3.3 years ago by rmartson30

xslt is a generic technolgy. You can't take any xslt-program and expect that it will do what you want. here, You have used a xslt transforming a blast to html....

ADD REPLYlink written 3.3 years ago by Pierre Lindenbaum131k

the following xslt stylesheet with extract the hit id/start-100/end+100 from your XML:

ADD REPLYlink written 3.3 years ago by Pierre Lindenbaum131k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1432 users visited in the last hour