Entering edit mode
10.5 years ago
radha.jg
•
0
Hi,
I'm a newbie so please be patient.
I have a fasta file like this:
>gi|820716087|gb|AKG62099.1| eIF-2 alpha kinase [Leishmania donovani]
MAKKKNECHSCRLVQAYNTCENDEIKDEIDIIVNTYENVRVSGKSAAHYRVLVPLTSESHPSRRVTLEIR
VVPGYPYVVPAINLLFPPGLQPGCEGTLSEYEVKQMAKEVLNNIQPCLPSGMPCMMQIVSTVASIVECSI
DPPSQQQNGKAQGEPKVLSAGQSSSLTPVPLKAKEALKLSLFAFHLLKKCCHMKNPESNEEAASNFDWLV
KYLLDSVRIFPEAARSFFPWNGISSSRAFAANIESALALPPDQQGLPKWLWEDEGRNPRIQQGSEGRYRN
>gi|820957452|pdb|4WZH|B Chain B, Dihydroorotate Dehydrogenase From Leishmania Viannia Braziliensis
MGSSHHHHHHSSGLVPRGSHMASMTGGGQMGRGSMSLQVGILGNTFANPFMNAAGVMCSTEEELAAMTES
TSGSLITKSCTPALREGNPAPRYYTLPLGSINSMGLPNKGFDFYLAYSARHHDYSRKPLFISISGFSAEE
NAEMCKRLAPVAAEKGVILELNLSCPNVPGKPQVAYDFDAMRRYLAAISEAYPHPFGVKMPPYFDFAHFD
AAAEILNQFPKVQFITCINSIGNGLVIDVETESVVIKPKQGFGGLGGRYVFPTALANVNAFYRRCPGKLI
FGCGGVYTGEDAFLHVLAGASMVQVGTALHEEGAAIFERLTAELLDVMAKKGYKALDEFRGKVKAMD
How do I transform the names to have something like this:
>gb_AKG62099.1
MAKKKNECHSCRLVQAYNTCENDEIKDEIDIIVNTYENVRVSGKSAAHYRVLVPLTSESHPSRRVTLEIR
VVPGYPYVVPAINLLFPPGLQPGCEGTLSEYEVKQMAKEVLNNIQPCLPSGMPCMMQIVSTVASIVECSI
DPPSQQQNGKAQGEPKVLSAGQSSSLTPVPLKAKEALKLSLFAFHLLKKCCHMKNPESNEEAASNFDWLV
KYLLDSVRIFPEAARSFFPWNGISSSRAFAANIESALALPPDQQGLPKWLWEDEGRNPRIQQGSEGRYRN
>pdb_4WZH
MGSSHHHHHHSSGLVPRGSHMASMTGGGQMGRGSMSLQVGILGNTFANPFMNAAGVMCSTEEELAAMTES
TSGSLITKSCTPALREGNPAPRYYTLPLGSINSMGLPNKGFDFYLAYSARHHDYSRKPLFISISGFSAEE
NAEMCKRLAPVAAEKGVILELNLSCPNVPGKPQVAYDFDAMRRYLAAISEAYPHPFGVKMPPYFDFAHFD
AAAEILNQFPKVQFITCINSIGNGLVIDVETESVVIKPKQGFGGLGGRYVFPTALANVNAFYRRCPGKLI
FGCGGVYTGEDAFLHVLAGASMVQVGTALHEEGAAIFERLTAELLDVMAKKGYKALDEFRGKVKAMD
The idea is to have just the genebank id, or, if it's not in the name, one of the ids and where is it from
Saludos :)
Excellent. Thank you very much. I seriously need to learn how to program in awk.
In case of
sp_, how do I use an unique identifier like gi? I assume that$1and$2will doYou could check the number of fields (
NF) in a more elaborate way and use$1and$2as needed.Hey Devon Ryan,
Could you please help me with some modification of your code for my problem?
I also want to shorten the fasta file sequence header, which looks like this:
And I want the header to be this:
And I tried your code with
It gave me the results:
SO How can I cut the
[locus_tag=and]?I would be really appreciated for any help.
Thanks!
Yanfang
I think I figured this out by adapting two codes together.
Thanks all the help!
Yanfang