How to rename duplicate fasta headers after the first underscore '_'?
2
1
Entering edit mode
2.6 years ago
MB ▴ 30

Hi,

I have a fasta file input.fa which has duplicate fasta headers.

    >UID584_Org_name_strain
    JGTTQEWEILYNMCCDASSHHGTQEFGRKKQLAADBNGTAVQEGFN
    >JGH1236_Org_name
    HFHGTSNNIGLTTREKKQLAADBNGTAVQEGFNMENDLALFVTYGAHLVDGNVLTDRLSIGGKTALTGVDP
    >JGH1236_Org_name
    FHGTSNNIGLTTREKKQLAADBNGTAVQEGFNMENDLALFVTYGAHLVDGNVLTDRLSIGGKTALTGV
    >KIL563.2_Org_name
    TTFWLMAFDSCVIIPPTREWWQQLGTGTSNNIGLTTREKKQLAADBN
    >KIL563.2_Org_name
    TTFWLMAFDSCVIIPPTREWWQQLGTGTSEKKQLAADBNGTAVQEGFNMENDLALFVTYGAHLV
    >KIL563.2_Org_name
    TTFWLMAFDSCVIIPPTREWWQQLGTGTSEKKQLAADBNGTAVQEGFNMENDLALFVTY
    >GTK584_Org_name_str
    KKQLAADBNGTAVQEGFNMENDLALFVTYGHFHGTSNNIGLTTREKKQLAAWLMAFDSCVIIPPTRE
    >GTK584_Org_name_str
    DBNGTAVQEGFNMENDLALFVTYGHFHGTSNNIGLTTREKKQLAAWLMAFDSCVIIPPTREADBNGTAVQEGFNME
    >EAD5624_Org_nam
    LTTREKKQLAADBNGTAVQEGFNMENDLALFVTYBNGTAVQEGFNMENDLALFVTYGAAWLMAFDSCVIIPPT
    >EAD5624_Org_nam
    LTTREKKQLAADBNGTAVQEGFNMENDLALFVTYBNGTAVQVIIPPTREWWQQLGTGTSEKKQLAADBNGTAVQEGF

I want to rename all the headers with numbers before the first underscore '_' like this:

    >UID584.1_Org_name_strain
    JGTTQEWEILYNMCCDASSHHGTQEFGRKKQLAADBNGTAVQEGFN
    >JGH1236.1_Org_name
    HFHGTSNNIGLTTREKKQLAADBNGTAVQEGFNMENDLALFVTYGAHLVDGNVLTDRLSIGGKTALTGVDP
    >JGH1236.2_Org_name
    FHGTSNNIGLTTREKKQLAADBNGTAVQEGFNMENDLALFVTYGAHLVDGNVLTDRLSIGGKTALTGV
    >KIL563.2.1_Org_name
    TTFWLMAFDSCVIIPPTREWWQQLGTGTSNNIGLTTREKKQLAADBN
    >KIL563.2.2_Org_name
    TTFWLMAFDSCVIIPPTREWWQQLGTGTSEKKQLAADBNGTAVQEGFNMENDLALFVTYGAHLV
    >KIL563.2.3_Org_name
    TTFWLMAFDSCVIIPPTREWWQQLGTGTSEKKQLAADBNGTAVQEGFNMENDLALFVTY
    >GTK584.1_Org_name_str
    KKQLAADBNGTAVQEGFNMENDLALFVTYGHFHGTSNNIGLTTREKKQLAAWLMAFDSCVIIPPTRE
    >GTK584.2_Org_name_str
    DBNGTAVQEGFNMENDLALFVTYGHFHGTSNNIGLTTREKKQLAAWLMAFDSCVIIPPTREADBNGTAVQEGFNME
    >EAD5624.1_Org_nam
    LTTREKKQLAADBNGTAVQEGFNMENDLALFVTYBNGTAVQEGFNMENDLALFVTYGAAWLMAFDSCVIIPPT
    >EAD5624.2_Org_nam
    LTTREKKQLAADBNGTAVQEGFNMENDLALFVTYBNGTAVQVIIPPTREWWQQLGTGTSEKKQLAADBNGTAVQEGF

It would be convenient if anybody could suggest me sed/awk/grep command for this. Any help would be appreciated. Thanks!

fasta header duplicate sed awk • 1.2k views
ADD COMMENT
3
Entering edit mode
2.6 years ago
5heikki 9.9k

This will fail if the sequences are not ordered as in your example:

awk 'BEGIN{OFS=FS="_"}{if(/^>/){CUR=$1;{if(CUR==PRE){NUM++}else{NUM=1}};$1="";print CUR"."NUM $0;PRE=CUR}else{print $0}}' in > out
ADD COMMENT
0
Entering edit mode

Thanks! worked perfectly!

ADD REPLY
2
Entering edit mode
2.6 years ago
Carambakaracho ★ 2.7k

This work, too. No order on sequences required. perl, though...

perl -ne 'if (/^>/){@a=split /_/; $h{$a[0]}++; $a[0].= ".".$h{$a[0]}; $s=join "_", @a; print $s;}else{print $_;}' <in >out
ADD COMMENT
0
Entering edit mode

Thanks, worked like a charm!

ADD REPLY

Login before adding your answer.

Traffic: 1986 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6