Entering edit mode
5.5 years ago
LindoNkambule
•
0
Hi everyone,
I have written a bash script, with a python script embedded in it, to extract flanking regions from a BED using hg19 as the reference and save these flanking regions into separate FASTA files using their record IDs. I have a FASTA file with all the flanking regions sequences upstream and downstream of my target site, my problem is when I try to save these flanking regions into separate FASTA files I only get the upstream sequences for every record id. How can I get the flanking regions sequences with the same record ID saved into different FASTA files?
Here's part of the code:
bedtools flank -i input.bed -g hg19_Ensembl.bed -b 150 > output.bed
bedtools getfasta -fi hg19.fa -bed output.bed -fo output.fa -name
python - <<END
#!usr/bin/python
from Bio import SeqIO
for seq in SeqIO.parse("output.fa", "fasta"):
SeqIO.write(seq, seq.id + ".fa","fasta")
Tool
type posts are for announcements of new, finished applications. This is a question. I have changed it for you, but please bear in mind for the future.Please post an example of your input file. My understanding is that you want to separate out each pair of sequences in to a new file, or you want every single flank as a separate fasta?
I want every single flank as a separate fasta file. For every target region on my BED, I want to extract flanking regions both upstream and downstream and the first two lines of the script do this very well. The problem arises when I want to write these flanking regions into separate files, and this is because for every target region the upstream and downstream sequences have the same record ID.
Parts of the input file
I meant your input fasta for python, but your fasta file from
bedtools
has the same ID for multiple sequences?In that case you can simply append some iterator to the output file name, else what will happen is the
.write
call will overwrite the file with the existing name. Something like:Alternatively perhaps it would be more useful to demarcate which sequence is which. Assuming your fasta has the structure:
You might want to do something like:
This necessarily presumes an upstream sequence will always be encountered before a downstream sequence for any given ID though.
EDIT: watch out for the dodgy syntax markdown on the forum, there is meant to be a
(
between.exists
andseq.id
)