I have fasta file namely 119XCA.fasta
as shown below,
>cellulase
ATGCTA
>gyrase
TGATGCT
>16s
TAGTATG
I need to remove all the fasta headers, keep the sequences one by one and need to write file name as a fasta header. The expected outcome is shown below,
>119XCA
ATGCTA
TGATGCT
TAGTATG
I have used the following script sed '/^>/d' foo.fa > out.fa
which remove the fasta headers but, i do not know how to manage to write file name as a header. Therefore, please help me to do the same.
(Note the first 3
sed
calls are useful for concatenating any fasta)I know this is super old, dunno if anyone will see but I'll give it a try.
I liked this one-liner, tried it, and it works in the sense that it deletes all headers in multifasta and concatenate sequences in one big sequence and at the beginning there is one header. It's just that for some reason the header and file name are
authorized_keys.fa
instead of the original file name. Does anyone know why?This is what I work on: a multifasta file of 8 genes, Every sequence has the same header (species name) and this is also the name of the multifasta file. So - filename is
Lkooheri.fasta
and it looks like:How are you calling the oneliner?
There's no part of the command which could create the string
authorized_keys.fa
de novo, so it must be coming from files in your local environment (authorised_keys is part of the SSH config).thank you for getting back to me :) I ran the command from the terminal exactly as you wrote it, while in the same folder as my multifasta. I checked the number of amino acid residues in my new mono-fasta file (called authorized keys) and it is the same as in multifasta so it works, and I changed the name of the header and file name manually, it's just bugging me what is wrong. :) I am not a Linux expert so can't figure this out on my own but at least it's working.
did you save the code as a file and then ran it like
bash scriptname.sh /path/to/files/*.fasta
?You only need the first
sed
command from @Joe's example to get the result. Save a to a new file by using> new.fa
at the end of the command below.Hi, thank you for the suggestion. It looks easier indeed. I tried your command and yes, it concatenates all fastas in file to one big sequence and leaves just one header at the top but the difference is that there are spaces left at the lines where the end of one sequence used to be in the original file. and I'm not sure if that will interfere with my downstream analysis (aligning with other sequences). Ignore the asterisks.
Yeah this is what the other elements of the subsequent
sed
commands deal with (linearising, and then wrapping back to 80 chars).Just FYI, I don't think that command will deal with the stop codons, so they may persist in the final sequence.
Just to belabour the point - I rechecked this and the code definitely works as intended (for piping it can be simplified to not create the file as so:
Note that the command above uses
gsed
which is what will be needed if you are on non-GNU systems.