Question: Edit header of multifasta file
0
gravatar for fec2
9 months ago by
fec220
fec220 wrote:

Hi, I have a multifasta file and I need to delete some part of the header for every fasta file. For example:

>Viridibacillus_arenosi_FSL_R5_0213-BK137_RS04360-22-CBS_domain-containing_protein <unknown description>
GCTAATGAAGTTATTGGCCTAGTGACAGAAAGGGATATAAAAAACGCGCTTCCTTCTTCC
CTGCTC------AAA
>Viridibacillus_arvi_DSM16317-AMD00_RS08865-16-acetoin_utilization_protein_AcuB <unknown description>
GCGAATGAAGTTATTGGCCTAGTAACAGAAAGGGATATAAAAAACGCCCTTCCATCTTCC
CTGCTC------AAA

I need to delete the part after "-" in the header which is "-BK137_RS04360-22-CBS_domain-containing_protein <unknown description="">" and "-AMD00_RS08865-16-acetoin_utilization_protein_AcuB <unknown description="">".

I tried

cut -d '-' -f 1 your_file.fasta > new_file.fasta

and

awk '{split($0,a,"-"); if(a[1]) print ">"a[1]; else print; }' my_file.fasta > new_file.fasta

but this is an alignment file, it removed the "-" in my sequence as well, which of course I don't want.

Thanks for your help!

Best regards,

Felix

alignment sequence • 392 views
ADD COMMENTlink modified 9 months ago by lakhujanivijay4.5k • written 9 months ago by fec220
1

Try the solutions out in this thread (modify as needed) : A: Fasta header trimming

There are multiple other threads that refer to fasta header manipulation. Please use google to do an external search on biostars.

ADD REPLYlink modified 9 months ago • written 9 months ago by genomax74k

Thanks. I am trying to use the "cut" command. However, if i use: cut -d '-' -f1 your_file.fasta > new_file.fasta. It will removed the "-" in my sequence. May I know any option for the cut command to be only apply for the fasta header?

ADD REPLYlink written 9 months ago by fec220

Apologies. Did not realize that you have - elsewhere in your sequences.

ADD REPLYlink modified 9 months ago • written 9 months ago by genomax74k
1
gravatar for Pierre Lindenbaum
9 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum124k wrote:

try

 sed '/^>/s/\-.*//'  input.fa

"for the lines starting with '>', subsitute 'everything after "-"' with empty string"

ADD COMMENTlink written 9 months ago by Pierre Lindenbaum124k

Worked well! Thanks for that.

ADD REPLYlink written 9 months ago by fec220
0
gravatar for lakhujanivijay
9 months ago by
lakhujanivijay4.5k
India
lakhujanivijay4.5k wrote:

Using seqkit

seqkit replace -p '(^[^-]+).*' -r '${1}'  <your_fasta_file>

output

>Viridibacillus_arenosi_FSL_R5_0213
GCTAATGAAGTTATTGGCCTAGTGACAGAAAGGGATATAAAAAACGCGCTTCCTTCTTCC
CTGCTC------AAA
>Viridibacillus_arvi_DSM16317
GCGAATGAAGTTATTGGCCTAGTAAC
ADD COMMENTlink written 9 months ago by lakhujanivijay4.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1026 users visited in the last hour