Question: Edit header of multifasta file
0
gravatar for fec2
19 months ago by
fec230
fec230 wrote:

Hi, I have a multifasta file and I need to delete some part of the header for every fasta file. For example:

>Viridibacillus_arenosi_FSL_R5_0213-BK137_RS04360-22-CBS_domain-containing_protein <unknown description>
GCTAATGAAGTTATTGGCCTAGTGACAGAAAGGGATATAAAAAACGCGCTTCCTTCTTCC
CTGCTC------AAA
>Viridibacillus_arvi_DSM16317-AMD00_RS08865-16-acetoin_utilization_protein_AcuB <unknown description>
GCGAATGAAGTTATTGGCCTAGTAACAGAAAGGGATATAAAAAACGCCCTTCCATCTTCC
CTGCTC------AAA

I need to delete the part after "-" in the header which is "-BK137_RS04360-22-CBS_domain-containing_protein <unknown description="">" and "-AMD00_RS08865-16-acetoin_utilization_protein_AcuB <unknown description="">".

I tried

cut -d '-' -f 1 your_file.fasta > new_file.fasta

and

awk '{split($0,a,"-"); if(a[1]) print ">"a[1]; else print; }' my_file.fasta > new_file.fasta

but this is an alignment file, it removed the "-" in my sequence as well, which of course I don't want.

Thanks for your help!

Best regards,

Felix

alignment sequence • 614 views
ADD COMMENTlink modified 19 months ago by lakhujanivijay5.2k • written 19 months ago by fec230
1

Try the solutions out in this thread (modify as needed) : A: Fasta header trimming

There are multiple other threads that refer to fasta header manipulation. Please use google to do an external search on biostars.

ADD REPLYlink modified 19 months ago • written 19 months ago by genomax90k

Thanks. I am trying to use the "cut" command. However, if i use: cut -d '-' -f1 your_file.fasta > new_file.fasta. It will removed the "-" in my sequence. May I know any option for the cut command to be only apply for the fasta header?

ADD REPLYlink written 19 months ago by fec230

Apologies. Did not realize that you have - elsewhere in your sequences.

ADD REPLYlink modified 19 months ago • written 19 months ago by genomax90k
1
gravatar for Pierre Lindenbaum
19 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum130k wrote:

try

 sed '/^>/s/\-.*//'  input.fa

"for the lines starting with '>', subsitute 'everything after "-"' with empty string"

ADD COMMENTlink written 19 months ago by Pierre Lindenbaum130k

Worked well! Thanks for that.

ADD REPLYlink written 19 months ago by fec230
0
gravatar for lakhujanivijay
19 months ago by
lakhujanivijay5.2k
India
lakhujanivijay5.2k wrote:

Using seqkit

seqkit replace -p '(^[^-]+).*' -r '${1}'  <your_fasta_file>

output

>Viridibacillus_arenosi_FSL_R5_0213
GCTAATGAAGTTATTGGCCTAGTGACAGAAAGGGATATAAAAAACGCGCTTCCTTCTTCC
CTGCTC------AAA
>Viridibacillus_arvi_DSM16317
GCGAATGAAGTTATTGGCCTAGTAAC
ADD COMMENTlink written 19 months ago by lakhujanivijay5.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1056 users visited in the last hour