Question

Edit header of multifasta file

0

Entering edit mode

5.2 years ago

fec2 ▴ 50

Hi, I have a multifasta file and I need to delete some part of the header for every fasta file. For example:

>Viridibacillus_arenosi_FSL_R5_0213-BK137_RS04360-22-CBS_domain-containing_protein <unknown description>
GCTAATGAAGTTATTGGCCTAGTGACAGAAAGGGATATAAAAAACGCGCTTCCTTCTTCC
CTGCTC------AAA
>Viridibacillus_arvi_DSM16317-AMD00_RS08865-16-acetoin_utilization_protein_AcuB <unknown description>
GCGAATGAAGTTATTGGCCTAGTAACAGAAAGGGATATAAAAAACGCCCTTCCATCTTCC
CTGCTC------AAA

I need to delete the part after "-" in the header which is "-BK137_RS04360-22-CBS_domain-containing_protein <unknown description="">" and "-AMD00_RS08865-16-acetoin_utilization_protein_AcuB <unknown description="">".

I tried

cut -d '-' -f 1 your_file.fasta > new_file.fasta

and

awk '{split($0,a,"-"); if(a[1]) print ">"a[1]; else print; }' my_file.fasta > new_file.fasta

but this is an alignment file, it removed the "-" in my sequence as well, which of course I don't want.

Thanks for your help!

Best regards,

Felix

sequence alignment • 2.7k views

ADD COMMENT • link updated 5.2 years ago by lakhujanivijay 5.8k • written 5.2 years ago by fec2 ▴ 50

1

Entering edit mode

Try the solutions out in this thread (modify as needed) : A: Fasta header trimming

There are multiple other threads that refer to fasta header manipulation. Please use google to do an external search on biostars.

ADD REPLY • link 5.2 years ago by GenoMax 141k

0

Entering edit mode

Thanks. I am trying to use the "cut" command. However, if i use: cut -d '-' -f1 your_file.fasta > new_file.fasta. It will removed the "-" in my sequence. May I know any option for the cut command to be only apply for the fasta header?

ADD REPLY • link 5.2 years ago by fec2 ▴ 50

0

Entering edit mode

Apologies. Did not realize that you have - elsewhere in your sequences.

ADD REPLY • link 5.2 years ago by GenoMax 141k

0

Entering edit mode

5.2 years ago

lakhujanivijay 5.8k

Using seqkit

seqkit replace -p '(^[^-]+).*' -r '${1}'  <your_fasta_file>

output

>Viridibacillus_arenosi_FSL_R5_0213
GCTAATGAAGTTATTGGCCTAGTGACAGAAAGGGATATAAAAAACGCGCTTCCTTCTTCC
CTGCTC------AAA
>Viridibacillus_arvi_DSM16317
GCGAATGAAGTTATTGGCCTAGTAAC

ADD COMMENT • link 5.2 years ago by lakhujanivijay 5.8k

score 2 · Accepted Answer · 2019-02-13

2

Entering edit mode

5.2 years ago

Pierre Lindenbaum 161k

try

 sed '/^>/s/\-.*//'  input.fa

"for the lines starting with '>', subsitute 'everything after "-"' with empty string"