Question: Remove text and keep '> + ID' in fasta file
1
gravatar for Harumi
9 months ago by
Harumi10
Brazil
Harumi10 wrote:

Hello,

I have multiple fasta sequences that are like this:

 >2p__scaffold_2__5799__6580__-__778568__0.00__0.00
 GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGCAGCCGGA
 GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAA
 >2p__scaffold_2__5799__6580__+__778569__0.00__0.00
 GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGC
 >1p__scaffold_2__11235__11438__-__830827__0.00__0.00
 GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGCAGCCGGA
 GCTTCAATCCAGGGGATCGAGGAGATCCAAAGCAGCAGAAGCGGCTCGACGATGGTGAGG
 ATTCGGGATCGGATTCAGCGCTCGTCGGGACTGG
 >1p__scaffold_2__33129__34129__+__811706__0.00__0.00
 GCTGGCGACGGATCTA

And I want to keep just the "> + ID" (numbers after __+/-__ and before __0.00_0.00)

So I expect an output like this:

>778568
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGCAGCCGGA
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAA
>778569
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGC

I searched for it and tried this:

sed 's@.*__-__@@' input.fa > output.fa

That removed __-__ and everything before it, including the ">" that I wanted to keep.

I also tried this to remove everything between ">" and __-__

sed -e 's/\>//' -e 's/\__-__.*//' input.fa > output.fa

But this removed everything after __-__

And this, that removed __0.00_0.00

sed 's/__0.00.*$//' input.fa > output.fa

Thank you for your help.

sed fasta • 490 views
ADD COMMENTlink modified 9 months ago by Alex Reynolds26k • written 9 months ago by Harumi10
1

Now THIS is how you write a "please help me with fasta headers" question!

ADD REPLYlink written 9 months ago by jrj.healey9.1k
4
gravatar for Alex Reynolds
9 months ago by
Alex Reynolds26k
Seattle, WA USA
Alex Reynolds26k wrote:

Try the following regular expression:

$ awk '{ if ($0 ~ /^>/) { match($1, /[+|-]__[0-9]+__/, m); print ">"substr(m[0], 4, length(m[0]) - 5); } else { print $0; } }' input.fa > output.fa

Then:

$ less output.fa
>778568
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGCAGCCGGA
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAA
>778569
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGC
>830827
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGCAGCCGGA
GCTTCAATCCAGGGGATCGAGGAGATCCAAAGCAGCAGAAGCGGCTCGACGATGGTGAGG
ATTCGGGATCGGATTCAGCGCTCGTCGGGACTGG
>811706
GCTGGCGACGGATCTA
ADD COMMENTlink written 9 months ago by Alex Reynolds26k

It worked! Thank you for your help!

ADD REPLYlink written 9 months ago by Harumi10
1

You're quite welcome!

ADD REPLYlink written 9 months ago by Alex Reynolds26k
3
gravatar for cschu181
9 months ago by
cschu1811.4k
cschu1811.4k wrote:

This should work if your headers follow the pattern that you specified:

sed 's/[>_]\+/_/g' yourfile.fasta | cut -f 8 -d _ | sed 's/^\([0-9]\)/>\1/'
ADD COMMENTlink written 9 months ago by cschu1811.4k

It worked! Thank you for your help!

ADD REPLYlink written 9 months ago by Harumi10
3
gravatar for swbarnes2
9 months ago by
swbarnes24.5k
United States
swbarnes24.5k wrote:
test=">2p__scaffold_2__5799__6580__-__778568__0.00__0.00"
echo $test | sed 's@.*__[\+-]__@>@' | sed 's@__.*@@'

Some might find:

 sed 's/.*__[\+-]__/>/' | sed 's/__.*//'

to be a bit more readable.

ADD COMMENTlink modified 9 months ago • written 9 months ago by swbarnes24.5k
1

you will miss out on the __+__ cases with this one ;)

ADD REPLYlink written 9 months ago by lieven.sterck3.3k
2

Ah, didn't see that at first, edited my answer to fit that requirement

ADD REPLYlink modified 9 months ago • written 9 months ago by swbarnes24.5k

Thank you for your help!

I tried:

 sed 's/.*__[\+-]__/>/' | sed 's/__.*//' input.fa > output.fa

But it took a long time processing so I canceled.

When I tried:

sed 's/.*__[\+-]__/>/' input.fa > output.fa | sed 's/__.*//' output.fa > output2.fa

The output was empty.

Why does this happen?

Thank you again!!

ADD REPLYlink modified 9 months ago • written 9 months ago by Harumi10
1

You need to put the input file before the '|' symbol, so like this:

sed 's/.*__[\+-]__/>/' input.fa | sed 's/__.*//' > output.fa

otherwise it is just waiting for input (== why it is taking so long)

the second is a wrong syntax and will indeed never work. The data stream stopped at ' > output.fa' so any pipe or such behind it will not do anything (and create empty file as you mention)

ADD REPLYlink written 9 months ago by lieven.sterck3.3k

It worked! Thank you very much for your helpful explanation! I am still a beginner in bioinformatics

ADD REPLYlink written 9 months ago by Harumi10
1
gravatar for cpad0112
11 weeks ago by
cpad011210k
India
cpad011210k wrote:

a little late to the party:

$ sed '/>/ s/^.*__\(\w\+\)__.*/>\1/g' file.fa

or

$ sed '/>/ s/^\(\W\).*__\(\w\+\)__.*/\1\2/g' file.fa 


>778568
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGCAGCCGGA
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAA
>778569
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGC
>830827
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGCAGCCGGA
GCTTCAATCCAGGGGATCGAGGAGATCCAAAGCAGCAGAAGCGGCTCGACGATGGTGAGG
ATTCGGGATCGGATTCAGCGCTCGTCGGGACTGG
>811706
GCTGGCGACGGATCTA
ADD COMMENTlink modified 11 weeks ago • written 11 weeks ago by cpad011210k

a little late to the party:

Always very welcome, though.

ADD REPLYlink written 11 weeks ago by Kevin Blighe33k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2096 users visited in the last hour