make fasta sequences names short
1
0
Entering edit mode
6.1 years ago
radha.jg • 0

Hi,

I'm a newbie so please be pacient.

I have a fasta file like this:

>gi|820716087|gb|AKG62099.1| eIF-2 alpha kinase [Leishmania donovani]
MAKKKNECHSCRLVQAYNTCENDEIKDEIDIIVNTYENVRVSGKSAAHYRVLVPLTSESHPSRRVTLEIR
VVPGYPYVVPAINLLFPPGLQPGCEGTLSEYEVKQMAKEVLNNIQPCLPSGMPCMMQIVSTVASIVECSI
DPPSQQQNGKAQGEPKVLSAGQSSSLTPVPLKAKEALKLSLFAFHLLKKCCHMKNPESNEEAASNFDWLV
KYLLDSVRIFPEAARSFFPWNGISSSRAFAANIESALALPPDQQGLPKWLWEDEGRNPRIQQGSEGRYRN

>gi|820957452|pdb|4WZH|B Chain B, Dihydroorotate Dehydrogenase From Leishmania Viannia Braziliensis
MGSSHHHHHHSSGLVPRGSHMASMTGGGQMGRGSMSLQVGILGNTFANPFMNAAGVMCSTEEELAAMTES
TSGSLITKSCTPALREGNPAPRYYTLPLGSINSMGLPNKGFDFYLAYSARHHDYSRKPLFISISGFSAEE
NAEMCKRLAPVAAEKGVILELNLSCPNVPGKPQVAYDFDAMRRYLAAISEAYPHPFGVKMPPYFDFAHFD
AAAEILNQFPKVQFITCINSIGNGLVIDVETESVVIKPKQGFGGLGGRYVFPTALANVNAFYRRCPGKLI
FGCGGVYTGEDAFLHVLAGASMVQVGTALHEEGAAIFERLTAELLDVMAKKGYKALDEFRGKVKAMD

how do I transform the names to have something like this:

>gb_AKG62099.1
MAKKKNECHSCRLVQAYNTCENDEIKDEIDIIVNTYENVRVSGKSAAHYRVLVPLTSESHPSRRVTLEIR
VVPGYPYVVPAINLLFPPGLQPGCEGTLSEYEVKQMAKEVLNNIQPCLPSGMPCMMQIVSTVASIVECSI
DPPSQQQNGKAQGEPKVLSAGQSSSLTPVPLKAKEALKLSLFAFHLLKKCCHMKNPESNEEAASNFDWLV
KYLLDSVRIFPEAARSFFPWNGISSSRAFAANIESALALPPDQQGLPKWLWEDEGRNPRIQQGSEGRYRN

>pdb_4WZH
MGSSHHHHHHSSGLVPRGSHMASMTGGGQMGRGSMSLQVGILGNTFANPFMNAAGVMCSTEEELAAMTES
TSGSLITKSCTPALREGNPAPRYYTLPLGSINSMGLPNKGFDFYLAYSARHHDYSRKPLFISISGFSAEE
NAEMCKRLAPVAAEKGVILELNLSCPNVPGKPQVAYDFDAMRRYLAAISEAYPHPFGVKMPPYFDFAHFD
AAAEILNQFPKVQFITCINSIGNGLVIDVETESVVIKPKQGFGGLGGRYVFPTALANVNAFYRRCPGKLI
FGCGGVYTGEDAFLHVLAGASMVQVGTALHEEGAAIFERLTAELLDVMAKKGYKALDEFRGKVKAMD

the idea is to have just the genebank id, or, if it's not in the name, one of the ids and where is it from

Saludos :)

sequence • 2.6k views
ADD COMMENT
0
Entering edit mode

Exellent. Thank u very much.

I seriously need to leran how to program in awk.

ADD REPLY
0
Entering edit mode

in case of sp_, how do I use an unique identifier like gi?

ADD REPLY
0
Entering edit mode

I asume that $1 and $2 will do?

ADD REPLY
1
Entering edit mode
6.1 years ago
awk 'BEGIN{FS="|"}{if(NF>1) {printf(">%s_%s\n", $3, $4)}else{print $0}}' foo.fa > fixed.fa
ADD COMMENT
0
Entering edit mode

in case of sp_, how do I use an unique identifier like gi? I asume that $1 and $2 will do

ADD REPLY
0
Entering edit mode

You could check the number of fields (NF) in a more elaborate way and use $1 and $2 as needed. 

ADD REPLY
0
Entering edit mode

Hey Devon Ryan,

Could you please help me with some modification of your code for my problem? I also want to shorten the fasta file sequence hearder, which looks like this:

lcl|VSMA01000001.1_prot_KAB5584702.1_1 [locus_tag=GE09DRAFT_1165795] [db_xref=InterPro:IPR002198,JGIDB:Conioc1_1165795] [protein=tetrahydroxynaphthalene reductase] [protein_id=KAB5584702.1] [location=join(1826..1931,1988..2458,2736..2863,2927..3064)] [gbkey=CDS] MPGLTTNTGKYDQIPGPLGLASASLEGKVALVTGAGRGIGREMAQELGRRGAKVIVNYANSQESAEEVVQAIKKSGSDAA SIKANVSDVDQIVRMFDEAVKVFGKLDIVCSNSGVVSFGHVKDVTPEEFDRVFNINTRGQFFVAREAYKHLEVGGRLILM GSITGQAKGVPKHAVYSGSKGTIETFVRCMAIDFGDKKITVNAVAPGGIKTDMYHAVCREYIPNGINLTDDEVDEYACTW SPLHRVGLPIDIARVVCFLASQDGEWINGKVLGIDGAACM lcl|VSMA01000001.1_prot_KAB5584705.1_4 [locus_tag=GE09DRAFT_52] [db_xref=InterPro:IPR010730,JGIDB:Conioc1_52] [protein=heterokaryon incompatibility protein-domain-containing protein] [protein_id=KAB5584705.1] [location=10796..11233] [gbkey=CDS] MPTRLLEIDPQANSRHIRLVSDTGILLKERYAALSHCWGKSPTNTTTKAVFVSHTQGIDILSLSKTFQHTIFVTRELGIR YLWIDSLCIIQDDEDDWKREAENMADVFANAFVTIAASASTDGDGGLFYPRALETERSGTVRWTI

And I want the header to be this:

GE09DRAFT_1165795

GE09DRAFT_52

And I tried your code with awk 'BEGIN{FS=" "}{if(NF>1) {printf(">%s\n", $2)}else{print $0}}' in.fasta > out.fasta it gave me the results:

[locus_tag=GE09DRAFT_1165795] ...

SO How can I cut the "[locus_tag=" and "]"?

I would be really appreciated for any help. Thanks! Yanfang

ADD REPLY
0
Entering edit mode

I think I figured this out by adpating two codes together.

awk 'BEGIN{FS=" "}{if(NF>1) {split($2,a,"="); split(a[2],b,"]"); printf(">%s\n",b[1])}else{print $0}}' in.fasta > out.fasta

Thanks all the help! Yanfang

ADD REPLY

Login before adding your answer.

Traffic: 2584 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6