Extracting 100bp flanking sites around aligned sequences from BLAST results using Biopython or Bioperl?
2
1
Entering edit mode
6.8 years ago
rmartson ▴ 30

Hello, I'm very new to programming for bioinformatics but have a task I need to be able to automate. I need to blastn search a specific nucleotide sequence (X) in the human genome. Then wherever it shows up, I need to retrieve the 100bp sequences up- and downstream of its location in the genome.

I've already learned how to run a BLAST search via Python, and how Seq objects can behave as strings. But I don't know anything about BLAST data and what kind of information is included in an xml file.

I think I could achieve this task if I had three things from BLAST output:

  1. The complete sequence for every region X is found in (let's say it's called "full_seq")

  2. The location of X in the genome (a-b)

  3. The location of the surrounding region in the genome (c-d)

Then the location of the 100bp flanking sequences would be something like full_seq[a-c-99:a-c-1] and full_seq[b-c+1:b-c+99], right?

Can these three pieces of information be retrieved from an XML file or any other format of BLAST output and how? And does Python have any way of carrying out BLAST in specific organisms? I've only seen the option to change the database i.e. non-redundant, human genomic+transcript. And I've seen an answer to this question say that you can enter the entrez query ID but entrez query and organism are two separate boxes in the BLAST search menu so I don't think that's it.

I've also heard that UCSC BLAT already returns the flanking sequences but I've never seen it.

Biopython Bioperl • 3.3k views
ADD COMMENT
2
Entering edit mode
6.8 years ago
Joe 21k

If you want to try this with BioPython, you can reverse engineer my script below. It acts on Genbank files in this case, but it internally converts them to FASTAs, blasts them, and then 'cuts' the sequence out based on the BLAST indices, which is what you want to do. You can also specify an 'offset' to capture sequence either side of your query.

Specifically, take a look at the notation used in these sections:

def getIndices(resultHandle):

    blast_result = SearchIO.read(resultHandle, 'blast-tab')

    print(blast_result[0][0])
    start = blast_result[0][0].hit_start  # This is how SearchIO from biopython accesses the first blast result [0]
    end = blast_result[0][0].hit_end

    return start, end

and then:

def slice(start, end, genbank, FPoffset, TPoffset):
    '''Subset the provided genbank to return the sub record.'''

    seqObj = SeqIO.read(genbank, 'genbank')
    subRecord = seqObj[start-FPoffset:end+TPoffset]

    return subRecord

So here, the script will take a start and end variable, and then sub-sets the seqObj by those indices seqObj[start:end], you can see that if you minus an offset from the start (5-prime) (FPoffset) and add an offset at the 3-prime end (TPoffset), you get a small, specifiable window, either side of your blast hit. (The script defaults this to 0, so ordinarily you'd just get the query you asked for exactly.)

ADD COMMENT
0
Entering edit mode

Thanks, sounds like just what I need. I'll be reading through until I understand

ADD REPLY
0
Entering edit mode

Happy to explain any points that might not be clear. If you already have a blast tabular file with the sequence you'd like, we can easily come up with a function that gives you back all the sequences plus an offset, but have a go first so you can learn! :)

ADD REPLY
0
Entering edit mode

Hello, I've checked it out. So I don't think I need the convert() and runBlast() functions since I'm beginning with a blast output in an .xml format. I imported all the functions from Genbank_slicer and then did this:

qresults = SearchIO.parse('out.xml', 'blast-xml') 
SearchIO.write(qresults, 'results.tab', 'blast-tab') 
blast_result = SearchIO.read('results.tab', 'blast-tab')
getIndices('results.tab')

So this will get return the start and end indices of the very first hit in the results.tab file. In my case the first hit is just the same sequence as my query so I don't get any flanks. In order to get the start and end for other hits I need to change the function so it returns blast_result[2][0].hit_start, blast_result[3][0].hit_start and so on. I guess I could use some kind of loop to do this and return the start and end for every hit. The numbers here just stand for rows and columns, right? I'm not worried so much about this part right now, although I can't figure out how to convert out.xml to a tabular format object without writing to a file and reading that first.

I'm stuck currently at the next function slice(). I don't want to slice parts out of the original genbank file, but the sequences in the blast output I got later. If I try it out on my blast output in genbank format: #This is after reading the blast results as tabular data getIndices('results.tab') slice(start, end, "sequence.gb", 100, 100) I get the result "NameError: name 'start' is not defined". This is kinda confusing because the variables returned by getIndices() and the arguments for slice() have the same name. But shouldn't "start" and "end" already be defined? And this function only looks at one sequence while I need to be able to load and slice every single sequence in a large genbank file.

For slice() would this work to return 100bp flanks on either side put together?

subRecord = seqObj[start-offset:start]+seqObj[end:end:+offset]

Where I've replaced FPoffset and TP offset with just offset (which is equal to 100).

ADD REPLY
0
Entering edit mode
6.8 years ago

Can these three pieces of information be retrieved from an XML file or any other format of BLAST output and how?

yes, XSLT https://www.google.fr/search?q=site%3Abiostars.org+blast+xslt

once you get the region extends it and call blastdbcmd https://www.ncbi.nlm.nih.gov/books/NBK279689/ to extract the region.

ADD COMMENT
0
Entering edit mode

Thanks, but I'm getting a strange format using that: https://pastebin.com/HbQyqZgZ

ADD REPLY
0
Entering edit mode

xslt is a generic technolgy. You can't take any xslt-program and expect that it will do what you want. here, You have used a xslt transforming a blast to html....

ADD REPLY
0
Entering edit mode

the following xslt stylesheet with extract the hit id/start-100/end+100 from your XML:

ADD REPLY

Login before adding your answer.

Traffic: 2802 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6