Question

Remove text and keep '> + ID' in fasta file

1

Entering edit mode

7.4 years ago

Harumi ▴ 20

Hello,

I have multiple fasta sequences that are like this:

 >2p__scaffold_2__5799__6580__-__778568__0.00__0.00
 GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGCAGCCGGA
 GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAA
 >2p__scaffold_2__5799__6580__+__778569__0.00__0.00
 GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGC
 >1p__scaffold_2__11235__11438__-__830827__0.00__0.00
 GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGCAGCCGGA
 GCTTCAATCCAGGGGATCGAGGAGATCCAAAGCAGCAGAAGCGGCTCGACGATGGTGAGG
 ATTCGGGATCGGATTCAGCGCTCGTCGGGACTGG
 >1p__scaffold_2__33129__34129__+__811706__0.00__0.00
 GCTGGCGACGGATCTA

And I want to keep just the "> + ID" (numbers after __+/-__ and before __0.00_0.00)

So I expect an output like this:

>778568
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGCAGCCGGA
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAA
>778569
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGC

I searched for it and tried this:

sed 's@.*__-__@@' input.fa > output.fa

That removed __-__ and everything before it, including the ">" that I wanted to keep.

I also tried this to remove everything between ">" and __-__

sed -e 's/\>//' -e 's/\__-__.*//' input.fa > output.fa

But this removed everything after __-__

And this, that removed __0.00_0.00

sed 's/__0.00.*$//' input.fa > output.fa

Thank you for your help.

fasta sed • 3.0k views

ADD COMMENT • link updated 7.4 years ago by Alex Reynolds 36k • written 7.4 years ago by Harumi ▴ 20

1

Entering edit mode

Now THIS is how you write a "please help me with fasta headers" question!

ADD REPLY • link 7.4 years ago by Joe 22k

2

Entering edit mode

6.8 years ago

cpad0112 21k

a little late to the party:

$ sed '/>/ s/^.*__\(\w\+\)__.*/>\1/g' file.fa

or

$ sed '/>/ s/^\(\W\).*__\(\w\+\)__.*/\1\2/g' file.fa 


>778568
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGCAGCCGGA
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAA
>778569
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGC
>830827
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGCAGCCGGA
GCTTCAATCCAGGGGATCGAGGAGATCCAAAGCAGCAGAAGCGGCTCGACGATGGTGAGG
ATTCGGGATCGGATTCAGCGCTCGTCGGGACTGG
>811706
GCTGGCGACGGATCTA

ADD COMMENT • link 6.8 years ago by cpad0112 21k

0

Entering edit mode

a little late to the party:

Always very welcome, though.

ADD REPLY • link 6.8 years ago by Kevin Blighe 89k

score 4 · Accepted Answer · 2018-03-08

4

Entering edit mode

7.4 years ago

Alex Reynolds 36k

Try the following regular expression:

$ awk '{ if ($0 ~ /^>/) { match($1, /[+|-]__[0-9]+__/, m); print ">"substr(m[0], 4, length(m[0]) - 5); } else { print $0; } }' input.fa > output.fa

Then:

$ less output.fa
>778568
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGCAGCCGGA
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAA
>778569
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGC
>830827
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGCAGCCGGA
GCTTCAATCCAGGGGATCGAGGAGATCCAAAGCAGCAGAAGCGGCTCGACGATGGTGAGG
ATTCGGGATCGGATTCAGCGCTCGTCGGGACTGG
>811706
GCTGGCGACGGATCTA

ADD COMMENT • link 7.4 years ago by Alex Reynolds 36k

0

Entering edit mode

It worked! Thank you for your help!

ADD REPLY • link 7.4 years ago by Harumi ▴ 20

1

Entering edit mode

You're quite welcome!

ADD REPLY • link 7.4 years ago by Alex Reynolds 36k

score 3 · Accepted Answer · 2018-03-08

3

Entering edit mode

7.4 years ago

cschu181 ★ 2.8k

This should work if your headers follow the pattern that you specified:

sed 's/[>_]\+/_/g' yourfile.fasta | cut -f 8 -d _ | sed 's/^\([0-9]\)/>\1/'

ADD COMMENT • link 7.4 years ago by cschu181 ★ 2.8k

0

Entering edit mode

It worked! Thank you for your help!

ADD REPLY • link 7.4 years ago by Harumi ▴ 20

score 3 · Accepted Answer · 2018-03-08

3

Entering edit mode

7.4 years ago

swbarnes2 15k

test=">2p__scaffold_2__5799__6580__-__778568__0.00__0.00"
echo $test | sed 's@.*__[\+-]__@>@' | sed 's@__.*@@'

Some might find:

 sed 's/.*__[\+-]__/>/' | sed 's/__.*//'

to be a bit more readable.

ADD COMMENT • link 7.4 years ago by swbarnes2 15k

1

Entering edit mode

you will miss out on the __+__ cases with this one ;)

ADD REPLY • link 7.4 years ago by lieven.sterck 15k

2

Entering edit mode

Ah, didn't see that at first, edited my answer to fit that requirement

ADD REPLY • link 7.4 years ago by swbarnes2 15k

0

Entering edit mode

Thank you for your help!

I tried:

 sed 's/.*__[\+-]__/>/' | sed 's/__.*//' input.fa > output.fa

But it took a long time processing so I canceled.

When I tried:

sed 's/.*__[\+-]__/>/' input.fa > output.fa | sed 's/__.*//' output.fa > output2.fa

The output was empty.

Why does this happen?

Thank you again!!

ADD REPLY • link 7.4 years ago by Harumi ▴ 20

1

Entering edit mode

You need to put the input file before the '|' symbol, so like this:

sed 's/.*__[\+-]__/>/' input.fa | sed 's/__.*//' > output.fa

otherwise it is just waiting for input (== why it is taking so long)

the second is a wrong syntax and will indeed never work. The data stream stopped at ' > output.fa' so any pipe or such behind it will not do anything (and create empty file as you mention)

ADD REPLY • link 7.4 years ago by lieven.sterck 15k

0

Entering edit mode

It worked! Thank you very much for your helpful explanation! I am still a beginner in bioinformatics

ADD REPLY • link 7.4 years ago by Harumi ▴ 20