Question

Extracting Flanking Regions using Biopython

0

Entering edit mode

5.5 years ago

LindoNkambule • 0

Hi everyone,

I have written a bash script, with a python script embedded in it, to extract flanking regions from a BED using hg19 as the reference and save these flanking regions into separate FASTA files using their record IDs. I have a FASTA file with all the flanking regions sequences upstream and downstream of my target site, my problem is when I try to save these flanking regions into separate FASTA files I only get the upstream sequences for every record id. How can I get the flanking regions sequences with the same record ID saved into different FASTA files?

Here's part of the code:

bedtools flank -i input.bed -g hg19_Ensembl.bed -b 150 > output.bed 

bedtools getfasta -fi hg19.fa -bed output.bed -fo output.fa -name  

python - <<END
#!usr/bin/python
from Bio import SeqIO

for seq in SeqIO.parse("output.fa", "fasta"):

    SeqIO.write(seq, seq.id + ".fa","fasta")

Biopython bedtools • 1.9k views

ADD COMMENT • link updated 5.5 years ago by Joe 21k • written 5.5 years ago by LindoNkambule • 0

0

Entering edit mode

Tool type posts are for announcements of new, finished applications. This is a question. I have changed it for you, but please bear in mind for the future.

Please post an example of your input file. My understanding is that you want to separate out each pair of sequences in to a new file, or you want every single flank as a separate fasta?

ADD REPLY • link 5.5 years ago by Joe 21k

0

Entering edit mode

I want every single flank as a separate fasta file. For every target region on my BED, I want to extract flanking regions both upstream and downstream and the first two lines of the script do this very well. The problem arises when I want to write these flanking regions into separate files, and this is because for every target region the upstream and downstream sequences have the same record ID.

Parts of the input file

chr1    2985731 3350906 PRDM16__ENST00000511072.1   0   +   2985823 3350253 0   16  129,350,51,138,103,208,148,154,1417,88,170,78,170,175,237,666,  0,116957,174919,315981,327323,333623,335571,336327,342216,345392,348660,349499,356413,356883,361704,364509,
chr1    2985741 3355183 PRDM16__ENST00000442529.2   0   +   2985823 3350375 0   17  119,350,51,135,103,208,148,154,1417,85,170,78,170,175,237,175,4886, 0,116947,174909,315974,327313,333613,335561,336317,342206,345385,348650,349489,356403,356873,361694,362788,364556,
chr1    2985741 3355183 PRDM16__ENST00000378398.3   0   +   2985823 3350375 0   18  119,350,51,135,3,103,208,148,154,1417,85,170,78,170,175,237,175,4943,   0,116947,174909,315974,325314,327313,333613,335561,336317,342206,345385,348650,349489,356403,356873,361694,362788,364499,
chr1    2985741 3355183 PRDM16__ENST00000441472.2   0   +   2985823 3350375 0   17  119,350,51,135,103,208,148,154,1417,85,170,78,170,175,237,175,4943, 0,116947,174909,315974,327313,333613,335561,336317,342206,345385,348650,349489,356403,356873,361694,362788,364499,

ADD REPLY • link 5.5 years ago by LindoNkambule • 0

0

Entering edit mode

I meant your input fasta for python, but your fasta file from bedtools has the same ID for multiple sequences?

In that case you can simply append some iterator to the output file name, else what will happen is the .write call will overwrite the file with the existing name. Something like:

for seq, i in enumerate(SeqIO.parse("output.fa", "fasta")):
    SeqIO.write(seq, seq.id + str(i) + ".fa","fasta")

Alternatively perhaps it would be more useful to demarcate which sequence is which. Assuming your fasta has the structure:

>Region1_upstream_flank
ATCACGG...
>Region1_downstream_flank
ATCACGG...
>Region2_upstream_flank
ATCACGG...
>Region2_downstream_flank
ATCACGG...

You might want to do something like:

import os

for seq in SeqIO.parse("output.fa", "fasta"):
    if not os.path.existsseq.id + ".fa"):
        SeqIO.write(seq, seq.id + "upstream" + ".fa","fasta")
    else:
        SeqIO.write(seq, seq.id + "downstream" + ".fa", "fasta")

This necessarily presumes an upstream sequence will always be encountered before a downstream sequence for any given ID though.

EDIT: watch out for the dodgy syntax markdown on the forum, there is meant to be a `(` between `.exists` and `seq.id`)

ADD REPLY • link 5.5 years ago by Joe 21k

EDIT: watch out for the dodgy syntax markdown on the forum, there is meant to be a ( between .exists and seq.id)

EDIT: watch out for the dodgy syntax markdown on the forum, there is meant to be a `(` between `.exists` and `seq.id`)