Editing header by adding pipe in fasta file
2
0
Entering edit mode
6.2 years ago

I want to edit my headers in fasta file by adding pipes but unable to do so. The header looks like this

>XP_002436309.2 NAC domain-containing protein 69 isoform X1 [Sorghum bicolor]
MPSTSISSASAAGKGGSKAMQPPPQLPAALPVGFRFRPTDEELVRHYLKPKIAGHAHADLLLIPDVDLSACEPWELPAKA

>XP_002436310.1 plastocyanin, chloroplastic [Sorghum bicolor]
MASLSSATITAPSAFAAPAARAVARRSSFTVRASLGKAAGTAAVAVAASALLAGGAMAQEVLLGANGGVLVFEPSEFTVK

to

>sp|XP_002436309.2| NAC domain-containing protein 69 isoform X1 [Sorghum bicolor]
MPSTSISSASAAGKGGSKAMQPPPQLPAALPVGFRFRPTDEELVRHYLKPKIAGHAHADLLLIPDVDLSACEPWELPAKA

>sp|XP_002436310.1 plastocyanin, chloroplastic [Sorghum bicolor]
MASLSSATITAPSAFAAPAARAVARRSSFTVRASLGKAAGTAAVAVAASALLAGGAMAQEVLLGANGGVLVFEPSEFTVK

I am able to add sp| using notepad++ but cannot do it after the accession number (KX035646.1).

Thank you for the help!

Header FASTA Grep Editing • 2.4k views
ADD COMMENT
0
Entering edit mode

We need a bit more information really.

Is it just one fasta header? Do you need sp in front of all of them? Is the accession number always the same?

ADD REPLY
0
Entering edit mode

Yes, this is just one header, the whole file has more than 150,000 sequences. All headers should have "sp" and then "pipe" and then accession and then "pipe". The accession number is different for all sequences.

ADD REPLY
0
Entering edit mode

See if this does it sed 's/^>/\>sp|/g' your_file > new_file.

Edit: Looks like you need another | after the accession. You should search biostars for leads. This is one of the most frequently asked questions here.

sed -e 's/^>/\>sp|/g' -e 's/\ Name/\|\ Name/g' your_file > new_file

ADD REPLY
0
Entering edit mode

This didn't work...

I got the same output as input...

>XP_002436309.2 NAC domain-containing protein 69 isoform X1 [Sorghum bicolor]
MPSTSISSASAAGKGGSKAMQPPPQLPAALPVGFRFRPTDEELVRHYLKPKIAGHAHADLLLIPDVDLSACEPWELPAKALIRSGDPEWFFFAPLDRKYPGGHRSNRSTAAGYWKATGKDRLIRSRRAGTLIGVKKTLVFHRGRAPRGHRTAWIMHEYRT
>XP_002436310.1 plastocyanin, chloroplastic [Sorghum bicolor]
MASLSSATITAPSAFAAPAARAVARRSSFTVRASLGKAAGTAAVAVAASALLAGGAMAQEVLLGANGGVLVFEPSEFTVKAGDTITFKNNAGYPHNVVFDEDEVPSGVDATKISQEEYLNAPGETYSVTLTVPGTYGFYCEPHQGAGMVGKVTVN
ADD REPLY
0
Entering edit mode

It did not work because in the example above you had put Name:. If the names are not consistent then this example should be added to the original post.

ADD REPLY
0
Entering edit mode

I tried to find but somehow my keywords were not matching it... I am sorry for it... I just modified it in the original post...

And this is again not generating second | in the output

>sp|XP_002436309.2 NAC domain-containing protein 69 isoform X1 [Sorghum bicolor]
MPSTSISSASAAGKGGSKAMQPPPQLPAALPVGFRFRPTDEELVRHYLKPKIAGHAHADLLLIPDVDLSACEPWELPAKA
ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode
6.2 years ago
GenoMax 141k

@Pierre's answer works for this: A: modify header of sequencs in fasta file

ADD COMMENT
0
Entering edit mode

I would suggest to not close the post for such reasons. The best course of action is to post the link to the duplicate as an answer and leave it that way. There is little to be gained from closing the post.

ADD REPLY
0
Entering edit mode
6.2 years ago
Hugo ▴ 380

Dear Muhammad, I would suggest you to try the "Rename header" option of SEDA (http://www.sing-group.org/seda/). Section 3.8.4 "Add prefix/suffix" of the manual explains you how to easily achieve what you want: first add a word (prefix "sp|") before the header id and then add a word (suffix "|") after the header id. Do not hesitate contact me if you need some help.

Regards,

Hugo.

ADD COMMENT

Login before adding your answer.

Traffic: 1534 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6