Question: Removing everything after the last underscore in the header of a fastafile
1
gravatar for timmers
16 months ago by
timmers30
timmers30 wrote:

Aloha,

I am having trouble figuring out how to remove everything after the last '_' in the sequence headers of a fasta file.

I would like this following series of headers

>ART01B_100_M7_ID100005_1
TAAGAGGAGGAATTTTTCATAGAGGATTGTCTGTAGACTTAGTAATTTTTTCTCTTCATTTAGCTGGAATTTCTTCTCTT
TTAGGGGCTGTAAATTTTATTACTACAATTCTTAATTGTCGATCTTTAGGGGTTTGGTGAGATGAATTGCCCTTATTTGT
>PAG05A_100_M7_ID102325_189
TAAGAGGAGGAATTTTTCATAGAGGATTGTCTGTAGACTTAGTAATTTTTTCTCTTCATTTAGCTGGAATTTCTTCTCTT
TTAGGGGCTGTAAATTTTATTACTACAATTCTTAATTGTCGATCTTTAGGGGTTTGGTGAGATGAATTGCCCTTATTTGT
>KIN05B_100_M7_ALT_ID230005_46
TAAGAGGAGGAATTTTTCATAGAGGATTGTCTGTAGACTTAGTAATTTTTTCTCTTCATTTAGCTGGAATTTCTTCTCTT
TTAGGGGCTGTAAATTTTATTACTACAATTCTTAATTGTCGATCTTTAGGGGTTTGGTGAGATGAATTGCCCTTATTTGT

to look like this:

>ART01B_100_M7_ID100005
TAAGAGGAGGAATTTTTCATAGAGGATTGTCTGTAGACTTAGTAATTTTTTCTCTTCATTTAGCTGGAATTTCTTCTCTT
TTAGGGGCTGTAAATTTTATTACTACAATTCTTAATTGTCGATCTTTAGGGGTTTGGTGAGATGAATTGCCCTTATTTGT
>PAG05A_100_M7_ID102325
TAAGAGGAGGAATTTTTCATAGAGGATTGTCTGTAGACTTAGTAATTTTTTCTCTTCATTTAGCTGGAATTTCTTCTCTT
TTAGGGGCTGTAAATTTTATTACTACAATTCTTAATTGTCGATCTTTAGGGGTTTGGTGAGATGAATTGCCCTTATTTGT
>KIN05B_100_M7_ALT_ID230005
TAAGAGGAGGAATTTTTCATAGAGGATTGTCTGTAGACTTAGTAATTTTTTCTCTTCATTTAGCTGGAATTTCTTCTCTT
TTAGGGGCTGTAAATTTTATTACTACAATTCTTAATTGTCGATCTTTAGGGGTTTGGTGAGATGAATTGCCCTTATTTGT

because there are various '_' in the sequence headers, this command, for example, won't work:

cat Alt_MACSE_Output.fasta | awk -F _ '/^>/ { print $1"_"$2"_"$3"_"$4 } /^[A-Z]/ {print $1}' > Alt.fasta

Can anyone help me please?

sequence • 1.0k views
ADD COMMENTlink modified 16 months ago by cpad011212k • written 16 months ago by timmers30

Hello and welcome to biostars timmers ,

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

Thank you!

ADD REPLYlink written 16 months ago by finswimmer13k
5
gravatar for finswimmer
16 months ago by
finswimmer13k
Germany
finswimmer13k wrote:

You can do this with sed

$ sed 's/_[^_]*$//' input.fa > output.fa

fin swimmer

ADD COMMENTlink written 16 months ago by finswimmer13k
4
gravatar for 5heikki
16 months ago by
5heikki8.6k
Finland
5heikki8.6k wrote:
awk 'BEGIN{OFS=FS="_"}{if(/^>/){NF--}}{print $0}' in.fa > out.fa
ADD COMMENTlink modified 16 months ago • written 16 months ago by 5heikki8.6k

Nice and simple. Still, I think it would help OP most if you explain in a sentence or two what the command is doing in order to generalize it for upcomming problems.

ADD REPLYlink written 16 months ago by ATpoint28k
2

Sure,

BEGIN{OFS=FS="_"} Before anything is read, underscore is set as field separator (FS) and output field separator (OFS). If OFS was omitted it would default to space and we would have >ART01B 100 M7 ID100005 instead of >ART01B_100_M7_ID100005

{if(/^>/){NF--}} If a line begins with ">" (e.g. it's a header line in fasta format), number of fields (NF) is reduced by one. NF = Number of fields, $NF = the value in the last field. The value in the second last field would be referred to as $(NF-1). Thus NF-- reduces number of fields by one (it could also be NF-=1). $NF-- would reduce the value in the last field by one, e.g. 3 would become 2. I think any non-numerical would become -1 because the operation was false..

{print $0} Print the whole line, you could replace this with just "1", but IMO it's more clear to always write it like this..

ADD REPLYlink modified 16 months ago • written 16 months ago by 5heikki8.6k
2
gravatar for cpad0112
16 months ago by
cpad011212k
India
cpad011212k wrote:

Another sed solution:

$ sed '/>/ s/\(.*\)_.*$/\1/g' test.fa

>ART01B_100_M7_ID100005
TAAGAGGAGGAATTTTTCATAGAGGATTGTCTGTAGACTTAGTAATTTTTTCTCTTCATTTAGCTGGAATTTCTTCTCTT
TTAGGGGCTGTAAATTTTATTACTACAATTCTTAATTGTCGATCTTTAGGGGTTTGGTGAGATGAATTGCCCTTATTTGT
>PAG05A_100_M7_ID102325
TAAGAGGAGGAATTTTTCATAGAGGATTGTCTGTAGACTTAGTAATTTTTTCTCTTCATTTAGCTGGAATTTCTTCTCTT
TTAGGGGCTGTAAATTTTATTACTACAATTCTTAATTGTCGATCTTTAGGGGTTTGGTGAGATGAATTGCCCTTATTTGT
>KIN05B_100_M7_ALT_ID230005
TAAGAGGAGGAATTTTTCATAGAGGATTGTCTGTAGACTTAGTAATTTTTTCTCTTCATTTAGCTGGAATTTCTTCTCTT
TTAGGGGCTGTAAATTTTATTACTACAATTCTTAATTGTCGATCTTTAGGGGTTTGGTGAGATGAATTGCCCTTATTTGT
ADD COMMENTlink written 16 months ago by cpad011212k
0
gravatar for Amitm
16 months ago by
Amitm1.7k
UK
Amitm1.7k wrote:

Hi, This should work -

awk -F '_' '{if (/^>/) print $1"_"$2"_"$3"_"$4; else print $0;}' test.fasta
ADD COMMENTlink written 16 months ago by Amitm1.7k

sorry, I think this is not removing stuff from the last '_' onwards, but just printing the 1st four '_' sep values.

ADD REPLYlink written 16 months ago by Amitm1.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 825 users visited in the last hour