2
0
Entering edit mode
3.8 years ago
fec2 ▴ 40

Hi, I have a multifasta file and I need to delete some part of the header for every fasta file. For example:

>Viridibacillus_arenosi_FSL_R5_0213-BK137_RS04360-22-CBS_domain-containing_protein <unknown description>
GCTAATGAAGTTATTGGCCTAGTGACAGAAAGGGATATAAAAAACGCGCTTCCTTCTTCC
CTGCTC------AAA
>Viridibacillus_arvi_DSM16317-AMD00_RS08865-16-acetoin_utilization_protein_AcuB <unknown description>
GCGAATGAAGTTATTGGCCTAGTAACAGAAAGGGATATAAAAAACGCCCTTCCATCTTCC
CTGCTC------AAA


I need to delete the part after "-" in the header which is "-BK137_RS04360-22-CBS_domain-containing_protein <unknown description="">" and "-AMD00_RS08865-16-acetoin_utilization_protein_AcuB <unknown description="">".

I tried

cut -d '-' -f 1 your_file.fasta > new_file.fasta


and

awk '{split(0,a,"-"); if(a[1]) print ">"a[1]; else print; }' my_file.fasta > new_file.fasta  but this is an alignment file, it removed the "-" in my sequence as well, which of course I don't want. Thanks for your help! Best regards, Felix sequence alignment • 1.9k views ADD COMMENT 1 Entering edit mode Try the solutions out in this thread (modify as needed) : A: Fasta header trimming There are multiple other threads that refer to fasta header manipulation. Please use google to do an external search on biostars. ADD REPLY 0 Entering edit mode Thanks. I am trying to use the "cut" command. However, if i use: cut -d '-' -f1 your_file.fasta > new_file.fasta. It will removed the "-" in my sequence. May I know any option for the cut command to be only apply for the fasta header? ADD REPLY 0 Entering edit mode Apologies. Did not realize that you have - elsewhere in your sequences. ADD REPLY 2 Entering edit mode 3.8 years ago try  sed '/^>/s/\-.*//' input.fa  "for the lines starting with '>', subsitute 'everything after "-"' with empty string" ADD COMMENT 0 Entering edit mode Worked well! Thanks for that. ADD REPLY 0 Entering edit mode 3.8 years ago Using seqkit seqkit replace -p '(^[^-]+).*' -r '{1}'  <your_fasta_file>


output

>Viridibacillus_arenosi_FSL_R5_0213
GCTAATGAAGTTATTGGCCTAGTGACAGAAAGGGATATAAAAAACGCGCTTCCTTCTTCC
CTGCTC------AAA
>Viridibacillus_arvi_DSM16317
GCGAATGAAGTTATTGGCCTAGTAAC