Question: make fasta sequences names short
0
gravatar for radha.jg
5.5 years ago by
radha.jg0
Uruguay
radha.jg0 wrote:

Hi,

I'm a newbie so please be pacient.

I have a fasta file like this:

>gi|820716087|gb|AKG62099.1| eIF-2 alpha kinase [Leishmania donovani]
MAKKKNECHSCRLVQAYNTCENDEIKDEIDIIVNTYENVRVSGKSAAHYRVLVPLTSESHPSRRVTLEIR
VVPGYPYVVPAINLLFPPGLQPGCEGTLSEYEVKQMAKEVLNNIQPCLPSGMPCMMQIVSTVASIVECSI
DPPSQQQNGKAQGEPKVLSAGQSSSLTPVPLKAKEALKLSLFAFHLLKKCCHMKNPESNEEAASNFDWLV
KYLLDSVRIFPEAARSFFPWNGISSSRAFAANIESALALPPDQQGLPKWLWEDEGRNPRIQQGSEGRYRN

>gi|820957452|pdb|4WZH|B Chain B, Dihydroorotate Dehydrogenase From Leishmania Viannia Braziliensis
MGSSHHHHHHSSGLVPRGSHMASMTGGGQMGRGSMSLQVGILGNTFANPFMNAAGVMCSTEEELAAMTES
TSGSLITKSCTPALREGNPAPRYYTLPLGSINSMGLPNKGFDFYLAYSARHHDYSRKPLFISISGFSAEE
NAEMCKRLAPVAAEKGVILELNLSCPNVPGKPQVAYDFDAMRRYLAAISEAYPHPFGVKMPPYFDFAHFD
AAAEILNQFPKVQFITCINSIGNGLVIDVETESVVIKPKQGFGGLGGRYVFPTALANVNAFYRRCPGKLI
FGCGGVYTGEDAFLHVLAGASMVQVGTALHEEGAAIFERLTAELLDVMAKKGYKALDEFRGKVKAMD

how do I transform the names to have something like this:

>gb_AKG62099.1
MAKKKNECHSCRLVQAYNTCENDEIKDEIDIIVNTYENVRVSGKSAAHYRVLVPLTSESHPSRRVTLEIR
VVPGYPYVVPAINLLFPPGLQPGCEGTLSEYEVKQMAKEVLNNIQPCLPSGMPCMMQIVSTVASIVECSI
DPPSQQQNGKAQGEPKVLSAGQSSSLTPVPLKAKEALKLSLFAFHLLKKCCHMKNPESNEEAASNFDWLV
KYLLDSVRIFPEAARSFFPWNGISSSRAFAANIESALALPPDQQGLPKWLWEDEGRNPRIQQGSEGRYRN

>pdb_4WZH
MGSSHHHHHHSSGLVPRGSHMASMTGGGQMGRGSMSLQVGILGNTFANPFMNAAGVMCSTEEELAAMTES
TSGSLITKSCTPALREGNPAPRYYTLPLGSINSMGLPNKGFDFYLAYSARHHDYSRKPLFISISGFSAEE
NAEMCKRLAPVAAEKGVILELNLSCPNVPGKPQVAYDFDAMRRYLAAISEAYPHPFGVKMPPYFDFAHFD
AAAEILNQFPKVQFITCINSIGNGLVIDVETESVVIKPKQGFGGLGGRYVFPTALANVNAFYRRCPGKLI
FGCGGVYTGEDAFLHVLAGASMVQVGTALHEEGAAIFERLTAELLDVMAKKGYKALDEFRGKVKAMD

the idea is to have just the genebank id, or, if it's not in the name, one of the ids and where is it from

Saludos :)

sequence • 2.3k views
ADD COMMENTlink modified 5.5 years ago • written 5.5 years ago by radha.jg0

Exellent. Thank u very much.

I seriously need to leran how to program in awk.

ADD REPLYlink written 5.5 years ago by radha.jg0

in case of sp_, how do I use an unique identifier like gi?

ADD REPLYlink written 5.5 years ago by radha.jg0

I asume that $1 and $2 will do?

ADD REPLYlink written 5.5 years ago by radha.jg0
1
gravatar for Devon Ryan
5.5 years ago by
Devon Ryan97k
Freiburg, Germany
Devon Ryan97k wrote:
awk 'BEGIN{FS="|"}{if(NF>1) {printf(">%s_%s\n", $3, $4)}else{print $0}}' foo.fa > fixed.fa
ADD COMMENTlink written 5.5 years ago by Devon Ryan97k

in case of sp_, how do I use an unique identifier like gi? I asume that $1 and $2 will do

ADD REPLYlink written 5.5 years ago by radha.jg0

You could check the number of fields (NF) in a more elaborate way and use $1 and $2 as needed. 

ADD REPLYlink written 5.5 years ago by Devon Ryan97k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1793 users visited in the last hour