Question

Rearranging order of descriptors in fasta file header

1

Entering edit mode

6.9 years ago

m.rhodes ▴ 50

Hi,

I'm trying to reformat my fasta file headers in terminal on my mac. Currently they look like this:

tpg|Magnaporthiopsis_incrustans|JF414846
tpg|Pyricularia_pennisetigena|AB818016
tpg|Inocybe_sororia|EU525947

However, I need them to look like this, with the species last:

tpg|JF414846|Magnaporthiopsis_incrustans
tpg|AB818016|Pyricularia_pennisetigena
tpg|EU525947|Inocybe_sororia

Could anyone advise on how best to do this?

Thanks in advance!

fasta unix • 3.9k views

ADD COMMENT • link updated 16 months ago by Ram 43k • written 6.9 years ago by m.rhodes ▴ 50

0

Entering edit mode

That's trivial using Biopython SeqIO. Do you have any programming experience?

But soon enough someone will post an awk one-liner for this I guess.

ADD REPLY • link 6.9 years ago by WouterDeCoster 47k

0

Entering edit mode

Try something. If it doesn't work, post the code you used in your original question with any error messages. Then, hopefully someone can help. If it does work, answer your question to show what you did.

ADD REPLY • link 6.9 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

Talking to me? It's trivial.

ADD REPLY • link 6.9 years ago by WouterDeCoster 47k

0

Entering edit mode

No, WouterDeCoster. Sorry, I hit reply on your comment instead of comment on the OP.

ADD REPLY • link 6.9 years ago by st.ph.n ★ 2.7k

0

Entering edit mode

My programming experience is fairly minimal. I'm trying to do this as part of a masters thesis, I have come across Biopython before but never used it. An awk based solution was really what I had in mind for this.

ADD REPLY • link 6.9 years ago by m.rhodes ▴ 50

2

Entering edit mode

6.9 years ago

Ram 43k

Use bioawk. Pick the sequence header with bioawk, then use another awk to split by |, rearrange to get $1$3$2

ADD COMMENT • link 6.9 years ago by Ram 43k

1

Entering edit mode

6.9 years ago

WouterDeCoster 47k

I wrote a small untested (Bio)python script, let me know if something doesn't work as expected.

ADD COMMENT • link 6.9 years ago by WouterDeCoster 47k

0

Entering edit mode

Quick question, Wouter: I was under the impression that FASTA headers are in the format >identifier description, where the string immediately following the > is the identifier and the description is separated from the identifier by a delimiter, which is usually a white space. Am I mistaken?

ADD REPLY • link 6.9 years ago by Ram 43k

0

Entering edit mode

I don't think there's any standard delimiter. Certainly whitespace wouldn't be standard, as if a fasta header contains a Genbank style product descriptor, that would have lots of whitespace. Genbank typically use ':' or '|' I think.

ADD REPLY • link 6.9 years ago by Joe 21k

0

Entering edit mode

I don't you if there is a convention. I'm in hindsight not sure if I chose the right attribute in my code but haven't received any feedback from OP yet.

ADD REPLY • link 6.9 years ago by WouterDeCoster 47k

0

Entering edit mode

I have a similar problem. My FASTA files headers look like this:

 >PF3D7_0100100.1-p1 | transcript=PF3D7_0100100.1 | gene=PF3D7_0100100 | organism=Plasmodium_falciparum_3D7 | gene_product=erythrocyte membrane protein 1, PfEMP1 | transcript_product=erythrocyte membrane protein 1, PfEMP1 | location=Pf3D7_01_v3:29510-37126(+) | protein_length=2163 | sequence_SO=chromosome | SO=protein_coding | is_pseudo=false.

However I need them to look like genome_id|protein_id ONLY to be used in blast+ and orthoMCL inputs. Problem is my FASTA headers do NOT contain genome IDs (or do they?) so how can I format it?

ADD REPLY • link updated 3.8 years ago by GenoMax 141k • written 3.8 years ago by Bioinfo_learner ▴ 40

1

Entering edit mode

You've already opened a question for this. Please do not spam other posts with your question or add comments asking users to look at your question. This is not good etiquette and repeating it will result in your user account being suspended.

ADD REPLY • link 3.8 years ago by Ram 43k

0

Entering edit mode

Thankyou for letting me know. Apologies I am new to Biostars!

ADD REPLY • link 3.8 years ago by Bioinfo_learner ▴ 40

1

Entering edit mode

3.8 years ago

JMMM ▴ 10

With awk

awk -F'|' '/^>/{print $1"|"$3"|"$2; next}{print}' foo.fasta

ADD COMMENT • link 3.8 years ago by JMMM ▴ 10

0

Entering edit mode

Hi! I liked this quick awk one-liner and it worked perfectly (I was rearranging my FASTA header from 1, 2, 3, 4 as delimited by "|" to be 2, 1, 3, 4)... except it also moved the ">" with the first chunk of text so that my headers no longer started with ">". Any suggestions on how to tweak this so my ">" is still at the beginning of my FASTA headers?

ADD REPLY • link 16 months ago by marina.good • 0

0

Entering edit mode

Add this before the header print statement: gsub(/^>/,"",$1) and then print like so: print ">"$1"|"$3|"|$2;. This will remove the > from the first segment and add it manually to the beginning of each header line.

ADD REPLY • link 16 months ago by Ram 43k

Ram · Accepted Answer · 2017-05-12

1

Entering edit mode

6.9 years ago

Joe 21k

Heres a slightly dodgy bash only script, if you don't want to/can't install BioAwk/Perl/Python etc. It assumes the input files are perfectly formatted fasta's with headers as per the OP, i.e. >xxxx|yyyyy|zzzzzzzzzz

(I haven't tested it with fasta's that weren't already linearised but it should work...)

#!/bin/bash
# $ script.sh sequences.fasta

while read line ; do
    if [ ${line:0:1} == ">" ] ; then
        desc1=$(echo $line | cut -d '|' -f 1)
        desc2=$(echo $line | cut -d '|' -f 2)
        desc3=$(echo $line | cut -d '|' -f 3)
    else
        seq=$line
    echo -e "$desc1|$desc3|$desc2"
    echo -e "$seq"
    fi
done < $1

In this script, the sequence fasta is looped over line by line, if it hits a header line (i.e. starts with a ">"), which in bash is expressed here as ${line:0:1} (the zero-th to first character range, AKA, the first character, is a ">"), the string is split on the | (pipe symbol) delimiter, and cut then picks out each section of the header and assigns it to a variable.

If the string didn't match the ">", it must be a sequence line, and can just be re-emitted as it is.

echo then prints each variable, separated by | again, in the order you want (1, 2, 3 becomes 1, 3, 2).

EDIT: a slightly neater version

#!/bin/bash
# $ script.sh sequences.fasta

while read line ; do
    if [ ${line:0:1} == ">" ] ; then
        IFS='|' read -a header <<< "$line"
    else
        seq="$line"
    echo -e "${header[0]}"\|"${header[2]}"\|"${header[1]}""\n""$seq"
    fi
done < $1

Same process basically, except this time we can make use of the inbuilt "IFS" (internal field separator), and tell it to use | instead. In conjuction with read -a we can turn a string, in this case the header line, in to an array, saving us a few lines of code.

We also don't have to handle multiple variable names explicity, since bash array syntax lets you address individual items with "${array[1]}" for example (which is zero based, so is actually the second element). "${array[@]}" would refer to every element.

If we include a "\n", we can also emit the fasta in a single echo line, though I personally would argue this is less neat (certainly less readable).

ADD COMMENT • link 6.9 years ago by Joe 21k

0

Entering edit mode

Hi! Unfortunately this didn't quite work, this was the sort of output it gave me.

tpg|Alternaria_sennae|KJ718230
||
ATCATTACACAAATATGAAGGCGGGCTGGCACCTCTCGGGGTGGCCAGCCTTGCTGAATTATTCCACCCGTGTCTTTTGCGTACTTCTTGTTTCCTTGGTGGGCTCGCCCACCACAAGGACCAACCCATAAACCTTTTTGTAATGGCAATCAGCGTCAGTAACAATGTAATAATTACAACTTTCAACAACGGATCTCTTGGTTCTGGCATCGATGAAGAACGCAGCGAAATGCGATAAGTAGTGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCTTTGGTATTCCAAAGGGCATGCCTGTTCGAGCGTCATTTGTACCCTCAAGCTTTGCTTGGTGTTGGGCGTCTTTTTGTCCCCCCCTTTGCGGGGAGACTCGCCTTAAAGTCATTGGCAGCCGGCCTACTGGTTTCGGAGCGCAGCACAAGTCGCGCTCTCTTCCAGCCCCAAGGTCTAGCATCCAACAAGCCTCTTTTTTTCAACT
||
tpg|Phacidium_grevilleae|KR476718
||
ATGAGATCATGCCCTTCGGGGTAGACCTCCCACCCTCTGTATACAATACCTTTGTTGCTTTGGCGGCCCCGTCGCAAGACAACCGGCTCCGGCTGGTCAGCGGCCGCCAGAGGAATCAAAACTCATATTATTATTGTCGTCTGAGTACTATATAATAGTTAAAACTTTCAACAACGGATCTCTTGGTTCTGGCATCGATGAAGAACGCAGCGAAATGCGATAAGTAATGTGAATTGCAGAATTCAGTGAATCATCGAATCTTTGAACGCACATTGCGCCCCCTGGTATTCCGGGGGGCATGCCTGTTCGAGCGTCATTACAACCCTCAAGCTCTGCTTGGTATTGGGTGTCACCCCCGGGTGCGCCTTAAAATCAGTGGCGGTGCCGTCTGGCTTCAAGCGTAGTAATACTTCTCGCTTTGGAGTCCGGGCGAGCGTCTTGCCAAAACCCCCATATTTTTTCAG

ADD REPLY • link updated 3.8 years ago by Ram 43k • written 6.9 years ago by m.rhodes ▴ 50

0

Entering edit mode

Do your fasta sequence headers not have the ">" that is standard? It will break if they don't. (I realise you didn't put them in your opening post, but I thought you may simply have neglected to include them. You should really make sure you incorporate them as it makes life much easier since you can easily extract the header line away from the rest of the sequence.

I've edited my post with a neater script FYI, but it should function pretty much the same, so won't solve your exact problem unless your files follow strict FASTA format.

ADD REPLY • link 6.9 years ago by Joe 21k

1

Entering edit mode

Ahh yes, that appeared to be the problem! Thanks for your help.

ADD REPLY • link 6.9 years ago by m.rhodes ▴ 50

0

Entering edit mode

If this or any of the answers here has solved your problem, make sure you accept an answer so tthat the thread doesn't remain open ended.

ADD REPLY • link 6.9 years ago by Joe 21k

0

Entering edit mode

I have a similar problem. My FASTA files headers look like this:

PF3D7_0100100.1-p1 | transcript=PF3D7_0100100.1 | gene=PF3D7_0100100 | organism=Plasmodium_falciparum_3D7 | gene_product=erythrocyte membrane protein 1, PfEMP1 | transcript_product=erythrocyte membrane protein 1, PfEMP1 | location=Pf3D7_01_v3:29510-37126(+) | protein_length=2163 | sequence_SO=chromosome | SO=protein_coding | is_pseudo=false.

However I need them to look like genome_id|protein_id ONLY to be used in blast+ and orthoMCL inputs. Problem is my FASTA headers do NOT contain genome IDs (or do they?) so how can I format it?

ADD REPLY • link 3.8 years ago by Bioinfo_learner ▴ 40