Edit header of multifasta file
2
0
Entering edit mode
5.2 years ago
fec2 ▴ 50

Hi, I have a multifasta file and I need to delete some part of the header for every fasta file. For example:

>Viridibacillus_arenosi_FSL_R5_0213-BK137_RS04360-22-CBS_domain-containing_protein <unknown description>
GCTAATGAAGTTATTGGCCTAGTGACAGAAAGGGATATAAAAAACGCGCTTCCTTCTTCC
CTGCTC------AAA
>Viridibacillus_arvi_DSM16317-AMD00_RS08865-16-acetoin_utilization_protein_AcuB <unknown description>
GCGAATGAAGTTATTGGCCTAGTAACAGAAAGGGATATAAAAAACGCCCTTCCATCTTCC
CTGCTC------AAA

I need to delete the part after "-" in the header which is "-BK137_RS04360-22-CBS_domain-containing_protein <unknown description="">" and "-AMD00_RS08865-16-acetoin_utilization_protein_AcuB <unknown description="">".

I tried

cut -d '-' -f 1 your_file.fasta > new_file.fasta

and

awk '{split($0,a,"-"); if(a[1]) print ">"a[1]; else print; }' my_file.fasta > new_file.fasta

but this is an alignment file, it removed the "-" in my sequence as well, which of course I don't want.

Thanks for your help!

Best regards,

Felix

sequence alignment • 2.7k views
ADD COMMENT
1
Entering edit mode

Try the solutions out in this thread (modify as needed) : A: Fasta header trimming

There are multiple other threads that refer to fasta header manipulation. Please use google to do an external search on biostars.

ADD REPLY
0
Entering edit mode

Thanks. I am trying to use the "cut" command. However, if i use: cut -d '-' -f1 your_file.fasta > new_file.fasta. It will removed the "-" in my sequence. May I know any option for the cut command to be only apply for the fasta header?

ADD REPLY
0
Entering edit mode

Apologies. Did not realize that you have - elsewhere in your sequences.

ADD REPLY
2
Entering edit mode
5.2 years ago

try

 sed '/^>/s/\-.*//'  input.fa

"for the lines starting with '>', subsitute 'everything after "-"' with empty string"

ADD COMMENT
0
Entering edit mode

Worked well! Thanks for that.

ADD REPLY
0
Entering edit mode
5.2 years ago

Using seqkit

seqkit replace -p '(^[^-]+).*' -r '${1}'  <your_fasta_file>

output

>Viridibacillus_arenosi_FSL_R5_0213
GCTAATGAAGTTATTGGCCTAGTGACAGAAAGGGATATAAAAAACGCGCTTCCTTCTTCC
CTGCTC------AAA
>Viridibacillus_arvi_DSM16317
GCGAATGAAGTTATTGGCCTAGTAAC
ADD COMMENT

Login before adding your answer.

Traffic: 2706 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6