I have a fasta file with hundreds of sequences and their respective headers. The headers (all of them) are in the format
>ABCD [id_123] (gene_XYZ) [protein_ijk] [protein_id=qqq] [123..899]
.......seqeunce............
>EFGH [id_999] (gene_PQR) [protein_tre] [protein_id=trs] [573..789]
......seqeunce............
and so on.....
For the header every info in parenthesis are continuous and are only separated by a single space each (just as written above). All I want to do is retain "ABCD" (the very first info) in the header corresponding to every sequence . I want to loop through all the headers that are present in the file and return something like this :
>ABCD
.....sequence.....
>EFGH
.....sequence.......
and so on......Any help is most appreciated and i am working with BASH and perl.
Thank and regards!
With
reformat.sh
from BBMap suite:reformat.sh in=your.fa out=new.fa trd=t
I dont know why the sequence is showing next to the header when i posted this here! Of course it is a fasta file and hence the sequences are directly below the headers.
I have reformatted your post to show the correct format of fasta files.
Okay. Thanks for that...