Question

How to edit the fasta headers in a multiline fasta file?

0

Entering edit mode

22 months ago

pinn ▴ 210

Hi,

I had 1000's of sequences in a fasta file. I'd like to delete the underscore and number (_1,_2,_34297...). at the end of the fasta headers ?

Original file

>XP_034398789.1_1
>XP_034398430.1_2
....
....
....
>XP_034381508.1_34297
>XP_034419373.1_34330
>XP_034419129.1_34363
>XP_034385161.1_38667

Expected output

>XP_034398789.1
>XP_034398430.1
....
....
....
>XP_034381508.1
>XP_034419373.1
>XP_034419129.1

Using , cut, I tried on sample data. It deletes the ">XP_" What I'd be cut command for deleting the characters/numbers after the XP_034398789.1_1

 cut -f2 -d'_' TEXT.fa.fa | sed '15~20s/^/>/'

 034419421.1
 034380977.1
 034381532.1

cut -d_ -f1,2 TEXT.fa.fa

    >XP_034398789.1
    >XP_034398430.1
    ....
    ....
    ....
    >XP_034381508.1
    >XP_034419373.1
    >XP_034419129.1

gene genome protein • 692 views

ADD COMMENT • link updated 22 months ago by cpad0112 21k • written 22 months ago by pinn ▴ 210

2

Entering edit mode

There are plenty of fasta-header-editing posts on the forum (I'm sure you would have seen a few in the years you've been here), and "delete everything after second underscore" will produce a ton of Google results. Did you try searching anywhere before creating a new post?

ADD REPLY • link 22 months ago by Ram 43k

0

Entering edit mode

$ awk -F "_" '/^>/{print $1"_"$2};!/>/' test.fa
$ sed -r '/^>/ s/_\w+//2' test.fa

ADD REPLY • link 22 months ago by cpad0112 21k

score 2 · Answer 1 · 2022-06-22

2

Entering edit mode

22 months ago

rpolicastro 13k

A seqkit answer for posterity.

seqkit replace -p "_\d+$" file.fasta

ADD COMMENT • link 22 months ago by rpolicastro 13k