Question: Removing everything after the last underscore in the header of a fastafile
1
gravatar for timmers
7 months ago by
timmers10
timmers10 wrote:

Aloha,

I am having trouble figuring out how to remove everything after the last '_' in the sequence headers of a fasta file.

I would like this following series of headers

>ART01B_100_M7_ID100005_1
TAAGAGGAGGAATTTTTCATAGAGGATTGTCTGTAGACTTAGTAATTTTTTCTCTTCATTTAGCTGGAATTTCTTCTCTT
TTAGGGGCTGTAAATTTTATTACTACAATTCTTAATTGTCGATCTTTAGGGGTTTGGTGAGATGAATTGCCCTTATTTGT
>PAG05A_100_M7_ID102325_189
TAAGAGGAGGAATTTTTCATAGAGGATTGTCTGTAGACTTAGTAATTTTTTCTCTTCATTTAGCTGGAATTTCTTCTCTT
TTAGGGGCTGTAAATTTTATTACTACAATTCTTAATTGTCGATCTTTAGGGGTTTGGTGAGATGAATTGCCCTTATTTGT
>KIN05B_100_M7_ALT_ID230005_46
TAAGAGGAGGAATTTTTCATAGAGGATTGTCTGTAGACTTAGTAATTTTTTCTCTTCATTTAGCTGGAATTTCTTCTCTT
TTAGGGGCTGTAAATTTTATTACTACAATTCTTAATTGTCGATCTTTAGGGGTTTGGTGAGATGAATTGCCCTTATTTGT

to look like this:

>ART01B_100_M7_ID100005
TAAGAGGAGGAATTTTTCATAGAGGATTGTCTGTAGACTTAGTAATTTTTTCTCTTCATTTAGCTGGAATTTCTTCTCTT
TTAGGGGCTGTAAATTTTATTACTACAATTCTTAATTGTCGATCTTTAGGGGTTTGGTGAGATGAATTGCCCTTATTTGT
>PAG05A_100_M7_ID102325
TAAGAGGAGGAATTTTTCATAGAGGATTGTCTGTAGACTTAGTAATTTTTTCTCTTCATTTAGCTGGAATTTCTTCTCTT
TTAGGGGCTGTAAATTTTATTACTACAATTCTTAATTGTCGATCTTTAGGGGTTTGGTGAGATGAATTGCCCTTATTTGT
>KIN05B_100_M7_ALT_ID230005
TAAGAGGAGGAATTTTTCATAGAGGATTGTCTGTAGACTTAGTAATTTTTTCTCTTCATTTAGCTGGAATTTCTTCTCTT
TTAGGGGCTGTAAATTTTATTACTACAATTCTTAATTGTCGATCTTTAGGGGTTTGGTGAGATGAATTGCCCTTATTTGT

because there are various '_' in the sequence headers, this command, for example, won't work:

cat Alt_MACSE_Output.fasta | awk -F _ '/^>/ { print $1"_"$2"_"$3"_"$4 } /^[A-Z]/ {print $1}' > Alt.fasta

Can anyone help me please?

sequence • 414 views
ADD COMMENTlink modified 7 months ago by cpad011211k • written 7 months ago by timmers10

Hello and welcome to biostars timmers ,

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

Thank you!

ADD REPLYlink written 7 months ago by finswimmer11k
5
gravatar for finswimmer
7 months ago by
finswimmer11k
Germany
finswimmer11k wrote:

You can do this with sed

$ sed 's/_[^_]*$//' input.fa > output.fa

fin swimmer

ADD COMMENTlink written 7 months ago by finswimmer11k
4
gravatar for 5heikki
7 months ago by
5heikki8.4k
Finland
5heikki8.4k wrote:
awk 'BEGIN{OFS=FS="_"}{if(/^>/){NF--}}{print $0}' in.fa > out.fa
ADD COMMENTlink modified 7 months ago • written 7 months ago by 5heikki8.4k

Nice and simple. Still, I think it would help OP most if you explain in a sentence or two what the command is doing in order to generalize it for upcomming problems.

ADD REPLYlink written 7 months ago by ATpoint15k
2

Sure,

BEGIN{OFS=FS="_"} Before anything is read, underscore is set as field separator (FS) and output field separator (OFS). If OFS was omitted it would default to space and we would have >ART01B 100 M7 ID100005 instead of >ART01B_100_M7_ID100005

{if(/^>/){NF--}} If a line begins with ">" (e.g. it's a header line in fasta format), number of fields (NF) is reduced by one. NF = Number of fields, $NF = the value in the last field. The value in the second last field would be referred to as $(NF-1). Thus NF-- reduces number of fields by one (it could also be NF-=1). $NF-- would reduce the value in the last field by one, e.g. 3 would become 2. I think any non-numerical would become -1 because the operation was false..

{print $0} Print the whole line, you could replace this with just "1", but IMO it's more clear to always write it like this..

ADD REPLYlink modified 7 months ago • written 7 months ago by 5heikki8.4k
2
gravatar for cpad0112
7 months ago by
cpad011211k
India
cpad011211k wrote:

Another sed solution:

$ sed '/>/ s/\(.*\)_.*$/\1/g' test.fa

>ART01B_100_M7_ID100005
TAAGAGGAGGAATTTTTCATAGAGGATTGTCTGTAGACTTAGTAATTTTTTCTCTTCATTTAGCTGGAATTTCTTCTCTT
TTAGGGGCTGTAAATTTTATTACTACAATTCTTAATTGTCGATCTTTAGGGGTTTGGTGAGATGAATTGCCCTTATTTGT
>PAG05A_100_M7_ID102325
TAAGAGGAGGAATTTTTCATAGAGGATTGTCTGTAGACTTAGTAATTTTTTCTCTTCATTTAGCTGGAATTTCTTCTCTT
TTAGGGGCTGTAAATTTTATTACTACAATTCTTAATTGTCGATCTTTAGGGGTTTGGTGAGATGAATTGCCCTTATTTGT
>KIN05B_100_M7_ALT_ID230005
TAAGAGGAGGAATTTTTCATAGAGGATTGTCTGTAGACTTAGTAATTTTTTCTCTTCATTTAGCTGGAATTTCTTCTCTT
TTAGGGGCTGTAAATTTTATTACTACAATTCTTAATTGTCGATCTTTAGGGGTTTGGTGAGATGAATTGCCCTTATTTGT
ADD COMMENTlink written 7 months ago by cpad011211k
0
gravatar for Amitm
7 months ago by
Amitm1.6k
UK
Amitm1.6k wrote:

Hi, This should work -

awk -F '_' '{if (/^>/) print $1"_"$2"_"$3"_"$4; else print $0;}' test.fasta
ADD COMMENTlink written 7 months ago by Amitm1.6k

sorry, I think this is not removing stuff from the last '_' onwards, but just printing the 1st four '_' sep values.

ADD REPLYlink written 7 months ago by Amitm1.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1277 users visited in the last hour