Removing everything after the last underscore in the header of a fastafile
4
1
Entering edit mode
5.6 years ago
timmers ▴ 30

Aloha,

I am having trouble figuring out how to remove everything after the last '_' in the sequence headers of a fasta file.

I would like this following series of headers

>ART01B_100_M7_ID100005_1
TAAGAGGAGGAATTTTTCATAGAGGATTGTCTGTAGACTTAGTAATTTTTTCTCTTCATTTAGCTGGAATTTCTTCTCTT
TTAGGGGCTGTAAATTTTATTACTACAATTCTTAATTGTCGATCTTTAGGGGTTTGGTGAGATGAATTGCCCTTATTTGT
>PAG05A_100_M7_ID102325_189
TAAGAGGAGGAATTTTTCATAGAGGATTGTCTGTAGACTTAGTAATTTTTTCTCTTCATTTAGCTGGAATTTCTTCTCTT
TTAGGGGCTGTAAATTTTATTACTACAATTCTTAATTGTCGATCTTTAGGGGTTTGGTGAGATGAATTGCCCTTATTTGT
>KIN05B_100_M7_ALT_ID230005_46
TAAGAGGAGGAATTTTTCATAGAGGATTGTCTGTAGACTTAGTAATTTTTTCTCTTCATTTAGCTGGAATTTCTTCTCTT
TTAGGGGCTGTAAATTTTATTACTACAATTCTTAATTGTCGATCTTTAGGGGTTTGGTGAGATGAATTGCCCTTATTTGT

to look like this:

>ART01B_100_M7_ID100005
TAAGAGGAGGAATTTTTCATAGAGGATTGTCTGTAGACTTAGTAATTTTTTCTCTTCATTTAGCTGGAATTTCTTCTCTT
TTAGGGGCTGTAAATTTTATTACTACAATTCTTAATTGTCGATCTTTAGGGGTTTGGTGAGATGAATTGCCCTTATTTGT
>PAG05A_100_M7_ID102325
TAAGAGGAGGAATTTTTCATAGAGGATTGTCTGTAGACTTAGTAATTTTTTCTCTTCATTTAGCTGGAATTTCTTCTCTT
TTAGGGGCTGTAAATTTTATTACTACAATTCTTAATTGTCGATCTTTAGGGGTTTGGTGAGATGAATTGCCCTTATTTGT
>KIN05B_100_M7_ALT_ID230005
TAAGAGGAGGAATTTTTCATAGAGGATTGTCTGTAGACTTAGTAATTTTTTCTCTTCATTTAGCTGGAATTTCTTCTCTT
TTAGGGGCTGTAAATTTTATTACTACAATTCTTAATTGTCGATCTTTAGGGGTTTGGTGAGATGAATTGCCCTTATTTGT

because there are various '_' in the sequence headers, this command, for example, won't work:

cat Alt_MACSE_Output.fasta | awk -F _ '/^>/ { print $1"_"$2"_"$3"_"$4 } /^[A-Z]/ {print $1}' > Alt.fasta

Can anyone help me please?

sequence • 5.8k views
ADD COMMENT
0
Entering edit mode

Hello and welcome to biostars timmers ,

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

Thank you!

ADD REPLY
6
Entering edit mode
5.6 years ago

You can do this with sed

$ sed 's/_[^_]*$//' input.fa > output.fa

fin swimmer

ADD COMMENT
4
Entering edit mode
5.6 years ago
5heikki 11k
awk 'BEGIN{OFS=FS="_"}{if(/^>/){NF--}}{print $0}' in.fa > out.fa
ADD COMMENT
0
Entering edit mode

Nice and simple. Still, I think it would help OP most if you explain in a sentence or two what the command is doing in order to generalize it for upcomming problems.

ADD REPLY
2
Entering edit mode

Sure,

BEGIN{OFS=FS="_"} Before anything is read, underscore is set as field separator (FS) and output field separator (OFS). If OFS was omitted it would default to space and we would have >ART01B 100 M7 ID100005 instead of >ART01B_100_M7_ID100005

{if(/^>/){NF--}} If a line begins with ">" (e.g. it's a header line in fasta format), number of fields (NF) is reduced by one. NF = Number of fields, $NF = the value in the last field. The value in the second last field would be referred to as $(NF-1). Thus NF-- reduces number of fields by one (it could also be NF-=1). $NF-- would reduce the value in the last field by one, e.g. 3 would become 2. I think any non-numerical would become -1 because the operation was false..

{print $0} Print the whole line, you could replace this with just "1", but IMO it's more clear to always write it like this..

ADD REPLY
2
Entering edit mode
5.6 years ago

Another sed solution:

$ sed '/>/ s/\(.*\)_.*$/\1/g' test.fa

>ART01B_100_M7_ID100005
TAAGAGGAGGAATTTTTCATAGAGGATTGTCTGTAGACTTAGTAATTTTTTCTCTTCATTTAGCTGGAATTTCTTCTCTT
TTAGGGGCTGTAAATTTTATTACTACAATTCTTAATTGTCGATCTTTAGGGGTTTGGTGAGATGAATTGCCCTTATTTGT
>PAG05A_100_M7_ID102325
TAAGAGGAGGAATTTTTCATAGAGGATTGTCTGTAGACTTAGTAATTTTTTCTCTTCATTTAGCTGGAATTTCTTCTCTT
TTAGGGGCTGTAAATTTTATTACTACAATTCTTAATTGTCGATCTTTAGGGGTTTGGTGAGATGAATTGCCCTTATTTGT
>KIN05B_100_M7_ALT_ID230005
TAAGAGGAGGAATTTTTCATAGAGGATTGTCTGTAGACTTAGTAATTTTTTCTCTTCATTTAGCTGGAATTTCTTCTCTT
TTAGGGGCTGTAAATTTTATTACTACAATTCTTAATTGTCGATCTTTAGGGGTTTGGTGAGATGAATTGCCCTTATTTGT
ADD COMMENT
0
Entering edit mode
5.6 years ago
Amitm ★ 2.2k

Hi, This should work -

awk -F '_' '{if (/^>/) print $1"_"$2"_"$3"_"$4; else print $0;}' test.fasta
ADD COMMENT
0
Entering edit mode

sorry, I think this is not removing stuff from the last '_' onwards, but just printing the 1st four '_' sep values.

ADD REPLY

Login before adding your answer.

Traffic: 2670 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6