Question

How to remove fasta headers in a multifasta file and write file name as a fasta header?

1

Entering edit mode

3.5 years ago

Kumar ▴ 120

I have fasta file namely 119XCA.fasta as shown below,

>cellulase
ATGCTA
>gyrase
TGATGCT
>16s
TAGTATG

I need to remove all the fasta headers, keep the sequences one by one and need to write file name as a fasta header. The expected outcome is shown below,

>119XCA
ATGCTA
TGATGCT
TAGTATG

I have used the following script sed '/^>/d' foo.fa > out.fa which remove the fasta headers but, i do not know how to manage to write file name as a header. Therefore, please help me to do the same.

gene sequence genome alignment next-gen • 2.4k views

ADD COMMENT • link updated 7 months ago by Joe 21k • written 3.5 years ago by Kumar ▴ 120

GenoMax · Accepted Answer · 2020-10-12

3

Entering edit mode

3.5 years ago

Joe 21k

Not the prettiest code in the world, but this will work.

Run it like so: bash scriptname.sh /path/to/files/*.fasta

for file in $1 ; do
    cat $file | sed -e '1!{/^>.*/d;}' | \
                sed ':a;N;$!ba;s/\n//2g' | \
                sed '1!s/.\{80\}/&\n/g' | \
                sed "s|>.*$|>${file##*/}|g" > $(basename "${file##*/}" ".fasta" ).fa
done

You can also do it as a oneliner for a single file if needed:

cat filename.fasta | sed -e '1!{/^>.*/d;}' | sed ':a;N;$!ba;s/\n//2g' | sed '1!s/.\{80\}/&\n/g' | sed "s|>.*$|>${file##*/}|g" > $(basename "${file##*/}" ".fasta" ).fa

ADD COMMENT • link 3.5 years ago by Joe 21k

0

Entering edit mode

(Note the first 3 sed calls are useful for concatenating any fasta)

ADD REPLY • link 3.5 years ago by Joe 21k

0

Entering edit mode

I know this is super old, dunno if anyone will see but I'll give it a try.

I liked this one-liner, tried it, and it works in the sense that it deletes all headers in multifasta and concatenate sequences in one big sequence and at the beginning there is one header. It's just that for some reason the header and file name are authorized_keys.fa instead of the original file name. Does anyone know why?

This is what I work on: a multifasta file of 8 genes, Every sequence has the same header (species name) and this is also the name of the multifasta file. So - filename is Lkooheri.fasta and it looks like:

>Lhookeri
RRKVN...
>Lhookeri
STLGKLLP...
>Lhookeri
VKEFG...
>Lhookeri
LIRMDACIA...
>Lhookeri
RRKVN...
>Lhookeri
STLGKLLP...
>Lhookeri
VKEFG...
>Lhookeri
LIRMDACIA...

ADD REPLY • link updated 7 months ago by GenoMax 141k • written 7 months ago by Lada ▴ 30

0

Entering edit mode

How are you calling the oneliner?

There's no part of the command which could create the string authorized_keys.fa de novo, so it must be coming from files in your local environment (authorised_keys is part of the SSH config).

ADD REPLY • link 7 months ago by Joe 21k

0

Entering edit mode

thank you for getting back to me :) I ran the command from the terminal exactly as you wrote it, while in the same folder as my multifasta. I checked the number of amino acid residues in my new mono-fasta file (called authorized keys) and it is the same as in multifasta so it works, and I changed the name of the header and file name manually, it's just bugging me what is wrong. :) I am not a Linux expert so can't figure this out on my own but at least it's working.

ADD REPLY • link 7 months ago by Lada ▴ 30

0

Entering edit mode

did you save the code as a file and then ran it like bash scriptname.sh /path/to/files/*.fasta?

ADD REPLY • link 7 months ago by WouterDeCoster 47k

0

Entering edit mode

$ more te.fa
>Lhookeri
RRKVN...
>Lhookeri
STLGKLLP...
>Lhookeri
VKEFG...
>Lhookeri
LIRMDACIA...
>Lhookeri
RRKVN...
>Lhookeri
STLGKLLP...

You only need the first sed command from @Joe's example to get the result. Save a to a new file by using > new.fa at the end of the command below.

$ cat te.fa | sed -e '1!{/^>.*/d;}' 
>Lhookeri
RRKVN...
STLGKLLP...
VKEFG...
LIRMDACIA...
RRKVN...
STLGKLLP...
VKEFG...
LIRMDACIA...

ADD REPLY • link 7 months ago by GenoMax 141k

0

Entering edit mode

Hi, thank you for the suggestion. It looks easier indeed. I tried your command and yes, it concatenates all fastas in file to one big sequence and leaves just one header at the top but the difference is that there are spaces left at the lines where the end of one sequence used to be in the original file. and I'm not sure if that will interfere with my downstream analysis (aligning with other sequences). Ignore the asterisks. enter image description here

ADD REPLY • link 7 months ago by Lada ▴ 30

0

Entering edit mode

Yeah this is what the other elements of the subsequent sed commands deal with (linearising, and then wrapping back to 80 chars).

Just FYI, I don't think that command will deal with the stop codons, so they may persist in the final sequence.

ADD REPLY • link 7 months ago by Joe 21k

0

Entering edit mode

Just to belabour the point - I rechecked this and the code definitely works as intended (for piping it can be simplified to not create the file as so:

$ cat test.fa
>Header_1
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
>Header_2
CCCCCCDDDDDD
>Header_3
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

$ cat test.fa | gsed -e '1!{/^>.*/d;}' | gsed ':a;N;$!ba;s/\n//2g' | gsed '1!s/.\{80\}/&\n/g'  

>Header_1
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
CCCCCCDDDDDDEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
EEEEEEEEEEEEEEEEEEEEEEEEEEEEFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG

Note that the command above uses gsed which is what will be needed if you are on non-GNU systems.

ADD REPLY • link 7 months ago by Joe 21k

score 2 · Accepted Answer · 2020-10-12

2

Entering edit mode

3.5 years ago

Shred ★ 1.4k

Assuming you're using BASH, use basename to get the filename with no PATH. Like:

filename=$(basename -i file | cut -d'.' -f1)

Then you could replace it using sed

sed -i "s/^\>.*$/$filename/" your.fasta

Remember to use double quotes to use variables in sed.

ADD COMMENT • link 3.5 years ago by Shred ★ 1.4k

0

Entering edit mode

I don't think this will concatenate the sequence?

ADD REPLY • link 3.5 years ago by Joe 21k

0

Entering edit mode

He said he's already got the concatenated file.

ADD REPLY • link 3.5 years ago by Shred ★ 1.4k

score 2 · Accepted Answer · 2020-10-12

2

Entering edit mode

3.5 years ago

cpad0112 21k

try this:

$ cat test.fa
>cellulase
ATGCTA
>gyrase
TGATGCT
>16s
TAGTATG

$  awk 'BEGIN {print ">"ARGV[1]};!/^>/{print}' test.fa

>test.fa
ATGCTA
TGATGCT
TAGTATG

$ cat <(echo ">"$basename test.fa) <(grep -v ">" test.fa) (note:extra space in header)
> test.fa
ATGCTA
TGATGCT
TAGTATG

ADD COMMENT • link 3.5 years ago by cpad0112 21k