Question: Extracting Flanking Regions using Biopython
gravatar for LindoNkambule
2.3 years ago by
South Africa
LindoNkambule0 wrote:

Hi everyone,

I have written a bash script, with a python script embedded in it, to extract flanking regions from a BED using hg19 as the reference and save these flanking regions into separate FASTA files using their record IDs. I have a FASTA file with all the flanking regions sequences upstream and downstream of my target site, my problem is when I try to save these flanking regions into separate FASTA files I only get the upstream sequences for every record id. How can I get the flanking regions sequences with the same record ID saved into different FASTA files?

Here's part of the code:

bedtools flank -i input.bed -g hg19_Ensembl.bed -b 150 > output.bed 

bedtools getfasta -fi hg19.fa -bed output.bed -fo output.fa -name  

python - <<END
from Bio import SeqIO

for seq in SeqIO.parse("output.fa", "fasta"):

    SeqIO.write(seq, + ".fa","fasta")
biopython bedtools • 881 views
ADD COMMENTlink modified 2.3 years ago by Joe18k • written 2.3 years ago by LindoNkambule0

Tool type posts are for announcements of new, finished applications. This is a question. I have changed it for you, but please bear in mind for the future.

Please post an example of your input file. My understanding is that you want to separate out each pair of sequences in to a new file, or you want every single flank as a separate fasta?

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by Joe18k

I want every single flank as a separate fasta file. For every target region on my BED, I want to extract flanking regions both upstream and downstream and the first two lines of the script do this very well. The problem arises when I want to write these flanking regions into separate files, and this is because for every target region the upstream and downstream sequences have the same record ID.

Parts of the input file

chr1    2985731 3350906 PRDM16__ENST00000511072.1   0   +   2985823 3350253 0   16  129,350,51,138,103,208,148,154,1417,88,170,78,170,175,237,666,  0,116957,174919,315981,327323,333623,335571,336327,342216,345392,348660,349499,356413,356883,361704,364509,
chr1    2985741 3355183 PRDM16__ENST00000442529.2   0   +   2985823 3350375 0   17  119,350,51,135,103,208,148,154,1417,85,170,78,170,175,237,175,4886, 0,116947,174909,315974,327313,333613,335561,336317,342206,345385,348650,349489,356403,356873,361694,362788,364556,
chr1    2985741 3355183 PRDM16__ENST00000378398.3   0   +   2985823 3350375 0   18  119,350,51,135,3,103,208,148,154,1417,85,170,78,170,175,237,175,4943,   0,116947,174909,315974,325314,327313,333613,335561,336317,342206,345385,348650,349489,356403,356873,361694,362788,364499,
chr1    2985741 3355183 PRDM16__ENST00000441472.2   0   +   2985823 3350375 0   17  119,350,51,135,103,208,148,154,1417,85,170,78,170,175,237,175,4943, 0,116947,174909,315974,327313,333613,335561,336317,342206,345385,348650,349489,356403,356873,361694,362788,364499,
ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by LindoNkambule0

I meant your input fasta for python, but your fasta file from bedtools has the same ID for multiple sequences?

In that case you can simply append some iterator to the output file name, else what will happen is the .write call will overwrite the file with the existing name. Something like:

for seq, i in enumerate(SeqIO.parse("output.fa", "fasta")):
    SeqIO.write(seq, + str(i) + ".fa","fasta")

Alternatively perhaps it would be more useful to demarcate which sequence is which. Assuming your fasta has the structure:


You might want to do something like:

import os

for seq in SeqIO.parse("output.fa", "fasta"):
    if not + ".fa"):
        SeqIO.write(seq, + "upstream" + ".fa","fasta")
        SeqIO.write(seq, + "downstream" + ".fa", "fasta")

This necessarily presumes an upstream sequence will always be encountered before a downstream sequence for any given ID though.

EDIT: watch out for the dodgy syntax markdown on the forum, there is meant to be a ( between .exists and

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by Joe18k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1374 users visited in the last hour