Question: Remove text and keep '> + ID' in fasta file
1
gravatar for Harumi
2.9 years ago by
Harumi10
Brazil
Harumi10 wrote:

Hello,

I have multiple fasta sequences that are like this:

 >2p__scaffold_2__5799__6580__-__778568__0.00__0.00
 GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGCAGCCGGA
 GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAA
 >2p__scaffold_2__5799__6580__+__778569__0.00__0.00
 GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGC
 >1p__scaffold_2__11235__11438__-__830827__0.00__0.00
 GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGCAGCCGGA
 GCTTCAATCCAGGGGATCGAGGAGATCCAAAGCAGCAGAAGCGGCTCGACGATGGTGAGG
 ATTCGGGATCGGATTCAGCGCTCGTCGGGACTGG
 >1p__scaffold_2__33129__34129__+__811706__0.00__0.00
 GCTGGCGACGGATCTA

And I want to keep just the "> + ID" (numbers after __+/-__ and before __0.00_0.00)

So I expect an output like this:

>778568
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGCAGCCGGA
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAA
>778569
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGC

I searched for it and tried this:

sed 's@.*__-__@@' input.fa > output.fa

That removed __-__ and everything before it, including the ">" that I wanted to keep.

I also tried this to remove everything between ">" and __-__

sed -e 's/\>//' -e 's/\__-__.*//' input.fa > output.fa

But this removed everything after __-__

And this, that removed __0.00_0.00

sed 's/__0.00.*$//' input.fa > output.fa

Thank you for your help.

sed fasta • 994 views
ADD COMMENTlink modified 2.9 years ago by Alex Reynolds31k • written 2.9 years ago by Harumi10
1

Now THIS is how you write a "please help me with fasta headers" question!

ADD REPLYlink written 2.9 years ago by Joe18k
4
gravatar for Alex Reynolds
2.9 years ago by
Alex Reynolds31k
Seattle, WA USA
Alex Reynolds31k wrote:

Try the following regular expression:

$ awk '{ if ($0 ~ /^>/) { match($1, /[+|-]__[0-9]+__/, m); print ">"substr(m[0], 4, length(m[0]) - 5); } else { print $0; } }' input.fa > output.fa

Then:

$ less output.fa
>778568
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGCAGCCGGA
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAA
>778569
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGC
>830827
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGCAGCCGGA
GCTTCAATCCAGGGGATCGAGGAGATCCAAAGCAGCAGAAGCGGCTCGACGATGGTGAGG
ATTCGGGATCGGATTCAGCGCTCGTCGGGACTGG
>811706
GCTGGCGACGGATCTA
ADD COMMENTlink written 2.9 years ago by Alex Reynolds31k

It worked! Thank you for your help!

ADD REPLYlink written 2.9 years ago by Harumi10
1

You're quite welcome!

ADD REPLYlink written 2.9 years ago by Alex Reynolds31k
3
gravatar for cschu181
2.9 years ago by
cschu1812.5k
cschu1812.5k wrote:

This should work if your headers follow the pattern that you specified:

sed 's/[>_]\+/_/g' yourfile.fasta | cut -f 8 -d _ | sed 's/^\([0-9]\)/>\1/'
ADD COMMENTlink written 2.9 years ago by cschu1812.5k

It worked! Thank you for your help!

ADD REPLYlink written 2.9 years ago by Harumi10
3
gravatar for swbarnes2
2.9 years ago by
swbarnes29.4k
United States
swbarnes29.4k wrote:
test=">2p__scaffold_2__5799__6580__-__778568__0.00__0.00"
echo $test | sed 's@.*__[\+-]__@>@' | sed 's@__.*@@'

Some might find:

 sed 's/.*__[\+-]__/>/' | sed 's/__.*//'

to be a bit more readable.

ADD COMMENTlink modified 2.9 years ago • written 2.9 years ago by swbarnes29.4k
1

you will miss out on the __+__ cases with this one ;)

ADD REPLYlink written 2.9 years ago by lieven.sterck9.4k
2

Ah, didn't see that at first, edited my answer to fit that requirement

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by swbarnes29.4k

Thank you for your help!

I tried:

 sed 's/.*__[\+-]__/>/' | sed 's/__.*//' input.fa > output.fa

But it took a long time processing so I canceled.

When I tried:

sed 's/.*__[\+-]__/>/' input.fa > output.fa | sed 's/__.*//' output.fa > output2.fa

The output was empty.

Why does this happen?

Thank you again!!

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by Harumi10
1

You need to put the input file before the '|' symbol, so like this:

sed 's/.*__[\+-]__/>/' input.fa | sed 's/__.*//' > output.fa

otherwise it is just waiting for input (== why it is taking so long)

the second is a wrong syntax and will indeed never work. The data stream stopped at ' > output.fa' so any pipe or such behind it will not do anything (and create empty file as you mention)

ADD REPLYlink written 2.9 years ago by lieven.sterck9.4k

It worked! Thank you very much for your helpful explanation! I am still a beginner in bioinformatics

ADD REPLYlink written 2.9 years ago by Harumi10
2
gravatar for cpad0112
2.3 years ago by
cpad011214k
Hyderabad India
cpad011214k wrote:

a little late to the party:

$ sed '/>/ s/^.*__\(\w\+\)__.*/>\1/g' file.fa

or

$ sed '/>/ s/^\(\W\).*__\(\w\+\)__.*/\1\2/g' file.fa 


>778568
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGCAGCCGGA
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAA
>778569
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGC
>830827
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGCAGCCGGA
GCTTCAATCCAGGGGATCGAGGAGATCCAAAGCAGCAGAAGCGGCTCGACGATGGTGAGG
ATTCGGGATCGGATTCAGCGCTCGTCGGGACTGG
>811706
GCTGGCGACGGATCTA
ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by cpad011214k

a little late to the party:

Always very welcome, though.

ADD REPLYlink written 2.3 years ago by Kevin Blighe69k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1468 users visited in the last hour
_