Heres a slightly dodgy bash only script, if you don't want to/can't install BioAwk/Perl/Python etc. It assumes the input files are perfectly formatted fasta's with headers as per the OP, i.e. >xxxx|yyyyy|zzzzzzzzzz
(I haven't tested it with fasta's that weren't already linearised but it should work...)
#!/bin/bash
# $ script.sh sequences.fasta
while read line ; do
if [ ${line:0:1} == ">" ] ; then
desc1=$(echo $line | cut -d '|' -f 1)
desc2=$(echo $line | cut -d '|' -f 2)
desc3=$(echo $line | cut -d '|' -f 3)
else
seq=$line
echo -e "$desc1|$desc3|$desc2"
echo -e "$seq"
fi
done < $1
In this script, the sequence fasta is looped over line by line, if it hits a header line (i.e. starts with a ">"), which in bash
is expressed here as ${line:0:1}
(the zero-th to first character range, AKA, the first character, is a ">"), the string is split on the |
(pipe symbol) delimiter, and cut
then picks out each section of the header and assigns it to a variable.
If the string didn't match the ">", it must be a sequence line, and can just be re-emitted as it is.
echo
then prints each variable, separated by |
again, in the order you want (1, 2, 3 becomes 1, 3, 2).
EDIT: a slightly neater version
#!/bin/bash
# $ script.sh sequences.fasta
while read line ; do
if [ ${line:0:1} == ">" ] ; then
IFS='|' read -a header <<< "$line"
else
seq="$line"
echo -e "${header[0]}"\|"${header[2]}"\|"${header[1]}""\n""$seq"
fi
done < $1
Same process basically, except this time we can make use of the inbuilt "IFS" (internal field separator), and tell it to use |
instead. In conjuction with read -a
we can turn a string, in this case the header line, in to an array, saving us a few lines of code.
We also don't have to handle multiple variable names explicity, since bash array syntax lets you address individual items with "${array[1]}"
for example (which is zero based, so is actually the second element). "${array[@]}"
would refer to every element.
If we include a "\n"
, we can also emit the fasta in a single echo line, though I personally would argue this is less neat (certainly less readable).
That's trivial using Biopython SeqIO. Do you have any programming experience?
But soon enough someone will post an awk one-liner for this I guess.
Try something. If it doesn't work, post the code you used in your original question with any error messages. Then, hopefully someone can help. If it does work, answer your question to show what you did.
Talking to me? It's trivial.
No, WouterDeCoster. Sorry, I hit reply on your comment instead of comment on the OP.
My programming experience is fairly minimal. I'm trying to do this as part of a masters thesis, I have come across Biopython before but never used it. An awk based solution was really what I had in mind for this.