Question: Extracting 100bp flanking sites around aligned sequences from BLAST results using Biopython or Bioperl?
0
gravatar for rmartson
21 months ago by
rmartson20
rmartson20 wrote:

Hello, I'm very new to programming for bioinformatics but have a task I need to be able to automate. I need to blastn search a specific nucleotide sequence (X) in the human genome. Then wherever it shows up, I need to retrieve the 100bp sequences up- and downstream of its location in the genome.

I've already learned how to run a BLAST search via Python, and how Seq objects can behave as strings. But I don't know anything about BLAST data and what kind of information is included in an xml file.

I think I could achieve this task if I had three things from BLAST output:

  1. The complete sequence for every region X is found in (let's say it's called "full_seq")

  2. The location of X in the genome (a-b)

  3. The location of the surrounding region in the genome (c-d)

Then the location of the 100bp flanking sequences would be something like full_seq[a-c-99:a-c-1] and full_seq[b-c+1:b-c+99], right?

Can these three pieces of information be retrieved from an XML file or any other format of BLAST output and how? And does Python have any way of carrying out BLAST in specific organisms? I've only seen the option to change the database i.e. non-redundant, human genomic+transcript. And I've seen an answer to this question say that you can enter the entrez query ID but entrez query and organism are two separate boxes in the BLAST search menu so I don't think that's it.

I've also heard that UCSC BLAT already returns the flanking sequences but I've never seen it.

biopython bioperl • 998 views
ADD COMMENTlink modified 21 months ago by jrj.healey11k • written 21 months ago by rmartson20
1
gravatar for jrj.healey
21 months ago by
jrj.healey11k
United Kingdom
jrj.healey11k wrote:

If you want to try this with BioPython, you can reverse engineer my script below. It acts on Genbank files in this case, but it internally converts them to FASTAs, blasts them, and then 'cuts' the sequence out based on the BLAST indices, which is what you want to do. You can also specify an 'offset' to capture sequence either side of your query.

Specifically, take a look at the notation used in these sections:

def getIndices(resultHandle):

    blast_result = SearchIO.read(resultHandle, 'blast-tab')

    print(blast_result[0][0])
    start = blast_result[0][0].hit_start  # This is how SearchIO from biopython accesses the first blast result [0]
    end = blast_result[0][0].hit_end

    return start, end

and then:

def slice(start, end, genbank, FPoffset, TPoffset):
    '''Subset the provided genbank to return the sub record.'''

    seqObj = SeqIO.read(genbank, 'genbank')
    subRecord = seqObj[start-FPoffset:end+TPoffset]

    return subRecord

So here, the script will take a start and end variable, and then sub-sets the seqObj by those indices seqObj[start:end], you can see that if you minus an offset from the start (5-prime) (FPoffset) and add an offset at the 3-prime end (TPoffset), you get a small, specifiable window, either side of your blast hit. (The script defaults this to 0, so ordinarily you'd just get the query you asked for exactly.)

ADD COMMENTlink modified 21 months ago • written 21 months ago by jrj.healey11k

Thanks, sounds like just what I need. I'll be reading through until I understand

ADD REPLYlink written 21 months ago by rmartson20

Happy to explain any points that might not be clear. If you already have a blast tabular file with the sequence you'd like, we can easily come up with a function that gives you back all the sequences plus an offset, but have a go first so you can learn! :)

ADD REPLYlink written 21 months ago by jrj.healey11k

Hello, I've checked it out. So I don't think I need the convert() and runBlast() functions since I'm beginning with a blast output in an .xml format. I imported all the functions from Genbank_slicer and then did this:

qresults = SearchIO.parse('out.xml', 'blast-xml') 
SearchIO.write(qresults, 'results.tab', 'blast-tab') 
blast_result = SearchIO.read('results.tab', 'blast-tab')
getIndices('results.tab')

So this will get return the start and end indices of the very first hit in the results.tab file. In my case the first hit is just the same sequence as my query so I don't get any flanks. In order to get the start and end for other hits I need to change the function so it returns blast_result[2][0].hit_start, blast_result[3][0].hit_start and so on. I guess I could use some kind of loop to do this and return the start and end for every hit. The numbers here just stand for rows and columns, right? I'm not worried so much about this part right now, although I can't figure out how to convert out.xml to a tabular format object without writing to a file and reading that first.

I'm stuck currently at the next function slice(). I don't want to slice parts out of the original genbank file, but the sequences in the blast output I got later. If I try it out on my blast output in genbank format: #This is after reading the blast results as tabular data getIndices('results.tab') slice(start, end, "sequence.gb", 100, 100) I get the result "NameError: name 'start' is not defined". This is kinda confusing because the variables returned by getIndices() and the arguments for slice() have the same name. But shouldn't "start" and "end" already be defined? And this function only looks at one sequence while I need to be able to load and slice every single sequence in a large genbank file.

For slice() would this work to return 100bp flanks on either side put together?

subRecord = seqObj[start-offset:start]+seqObj[end:end:+offset]

Where I've replaced FPoffset and TP offset with just offset (which is equal to 100).

ADD REPLYlink modified 21 months ago • written 21 months ago by rmartson20
0
gravatar for Pierre Lindenbaum
21 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum118k wrote:

Can these three pieces of information be retrieved from an XML file or any other format of BLAST output and how?

yes, XSLT https://www.google.fr/search?q=site%3Abiostars.org+blast+xslt

once you get the region extends it and call blastdbcmd https://www.ncbi.nlm.nih.gov/books/NBK279689/ to extract the region.

ADD COMMENTlink written 21 months ago by Pierre Lindenbaum118k

Thanks, but I'm getting a strange format using that: https://pastebin.com/HbQyqZgZ

ADD REPLYlink written 21 months ago by rmartson20

xslt is a generic technolgy. You can't take any xslt-program and expect that it will do what you want. here, You have used a xslt transforming a blast to html....

ADD REPLYlink written 21 months ago by Pierre Lindenbaum118k

the following xslt stylesheet with extract the hit id/start-100/end+100 from your XML:

ADD REPLYlink written 21 months ago by Pierre Lindenbaum118k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1416 users visited in the last hour