Question

How to limit fasta header to 40 characters?

0

Entering edit mode

7 months ago

fishingline37 • 0

I have FASTA headers with long annotation names, but the program it will be run through for proteomics has a limit of roughly 40 characters or else it will crash.

The file starts off like this:

>TRINITY_DN0_c0_g1_i1.p1 - RecName: Full=E3 ubiquitin-protein ligase CIP8; AltName: Full=COP1-interacting protein 8; AltName: Full=RING-type E3 ubiquitin transferase CIP8
SEQUENCE

>TRINITY_DN10003_c0_g1_i12.p1 - RecName: Full=Polycomb group protein FIE1; AltName: Full=Protein  FERTILIZATION-INDEPENDENT ENDOSPERM 2; Short=OsFIE2; AltName: Full=WD40 repeat-containing protein 153; Short=OsWD40-153
SEQUENCE

Ideally I want the FASTA to look like this:

>DN0_c0_g1_i1.p1 - E3 ubiquitin-protein 
SEQUENCE

>DN10003_c0_g1_i12.p1 - Polycomb group
SEQUENCE

I used sed and seqkit to cut out the repetitive parts

sed 's/>.*Y_/>/' proteome.fasta
seqkit replace -p " RecName: Full=" -r ' ' proteome.fasta > proteome2.fasta

The fasta looks like this now:

>DN0_c0_g1_i1.p1 - E3 ubiquitin-protein ligase CIP8; AltName: Full=COP1-interacting protein 8; AltName: Full=RING-type E3 ubiquitin transferase CIP8
SEQUENCE

>DN10003_c0_g1_i12.p1 - Polycomb group protein FIE1; AltName: Full=Protein  FERTILIZATION-INDEPENDENT ENDOSPERM 2; Short=OsFIE2; AltName: Full=WD40 repeat-containing protein 153; Short=OsWD40-153
SEQUENCE

What can I do to limit the header length? Can I do it with seqkit?

unix seqkit fasta • 1.9k views

ADD COMMENT • link updated 7 months ago by Roop • 0 • written 7 months ago by fishingline37 • 0

3

Entering edit mode

How about this (starting with your last example file)

$ awk '(">" ~ $0); {print substr($0,0,40)}' fasta_file
>DN0_c0_g1_i1.p1 - E3 ubiquitin-protein 
SEQUENCE
SEQUENCE2


>DN10003_c0_g1_i12.p1 - Polycomb group p
SEQUENCE

ADD REPLY • link 7 months ago by GenoMax 141k

0

Entering edit mode

OP also wants TRINITY_ removed so a sed might be required before the awk. However, given that OP has already figured out the sed, maybe use proteome.fasta instead of fasta_file in your code so OP knows not to replace their entire code with your awk.

ADD REPLY • link 7 months ago by Ram 43k

0

Entering edit mode

I used the last example that OP showed above. fasta_file is just a place holder for file name.

ADD REPLY • link 7 months ago by GenoMax 141k

0

Entering edit mode

Only keep the sequence identifiers:

seqkit seq -i proteome.fasta -o proteome2.fasta

Where,

-i, --only-id                   print IDs instead of full headers

Or

$ seqkit replace -p ' - RecName: Full=(.+?);.+' -r ' $1' proteome.fasta  
>TRINITY_DN0_c0_g1_i1.p1 E3 ubiquitin-protein ligase CIP8
SEQUENCE
>TRINITY_DN10003_c0_g1_i12.p1 Polycomb group protein FIE1
SEQUENCE

But the number of the characters may exceed the limit.

ADD REPLY • link 7 months ago by shenwei356 8.4k

0

Entering edit mode

This is a neat use of seqkit, I will definitely keep this in mind for future projects.

ADD REPLY • link 7 months ago by fishingline37 • 0

0

Entering edit mode

Perhaps you should also check that after trimming the names you don't get duplicate IDs. These two commands should give the same output (not checked):

grep '>' before_trimming.fasta | sort | uniq | wc -l
grep '>' after_trimming.fasta | sort | uniq | wc -l

ADD REPLY • link 7 months ago by dariober 14k

0

Entering edit mode

7 months ago

bk11 ★ 2.4k

Please check this, it will work-

sed 's/TRINITY_//g' proteome.fasta |sed 's/RecName: Full=//g' |awk '/^>/{print substr($0, 1, 39)} !/^>/{print}'

ADD COMMENT • link 7 months ago by bk11 ★ 2.4k

score 2 · Accepted Answer · 2023-09-05

2

Entering edit mode

7 months ago

bk11 ★ 2.4k

You can do sth like this-

cat proteome.fasta
>TRINITY_DN0_c0_g1_i1.p1 - RecName: Full=E3 ubiquitin-protein ligase CIP8; AltName: Full=COP1-interacting protein 8; AltName: Full=RING-type E3 ubiquitin transferase CIP8
SEQUENCE

>TRINITY_DN10003_c0_g1_i12.p1 - RecName: Full=Polycomb group protein FIE1; AltName: Full=Protein  FERTILIZATION-INDEPENDENT ENDOSPERM 2; Short=OsFIE2; AltName: Full=WD40 repeat-containing protein 153; Short=OsWD40-153
SEQUENCE

cut -d ' ' -f1-5 proteome.fasta |sed 's/RecName: Full=//g' |sed 's/TRINITY_//g'
>DN0_c0_g1_i1.p1 - E3 ubiquitin-protein
SEQUENCE

>DN10003_c0_g1_i12.p1 - Polycomb group
SEQUENCE

ADD COMMENT • link 7 months ago by bk11 ★ 2.4k

0

Entering edit mode

This won't work if the non-blank-space separated words are too long. Given that the 40 char limit is the primary concern, the cut should be something like -c 1-40 and it should come after the sed operations.

ADD REPLY • link 7 months ago by Ram 43k

3

Entering edit mode

Please check this too if it works.

sed 's/TRINITY_//g' proteome.fasta |sed 's/RecName: Full=//g' |awk '/^>/{print substr($0, 1, 39)} !/^>/{print}'

ADD REPLY • link 7 months ago by bk11 ★ 2.4k

0

Entering edit mode

This is the most complete and concise answer. The two seds instead of using seqkit is great.

The awk for the character limit also works like a charm. If you make this the main post, I will accept it as the answer. Thanks for the nice solution!

ADD REPLY • link 7 months ago by fishingline37 • 0

0

Entering edit mode

Ok. I am making it as the main post. Please go ahead and accept it as answer. Thanks!

ADD REPLY • link 7 months ago by bk11 ★ 2.4k