Question: Remove text and keep '> + ID' in fasta file
1
gravatar for Harumi
18 months ago by
Harumi10
Brazil
Harumi10 wrote:

Hello,

I have multiple fasta sequences that are like this:

 >2p__scaffold_2__5799__6580__-__778568__0.00__0.00
 GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGCAGCCGGA
 GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAA
 >2p__scaffold_2__5799__6580__+__778569__0.00__0.00
 GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGC
 >1p__scaffold_2__11235__11438__-__830827__0.00__0.00
 GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGCAGCCGGA
 GCTTCAATCCAGGGGATCGAGGAGATCCAAAGCAGCAGAAGCGGCTCGACGATGGTGAGG
 ATTCGGGATCGGATTCAGCGCTCGTCGGGACTGG
 >1p__scaffold_2__33129__34129__+__811706__0.00__0.00
 GCTGGCGACGGATCTA

And I want to keep just the "> + ID" (numbers after __+/-__ and before __0.00_0.00)

So I expect an output like this:

>778568
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGCAGCCGGA
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAA
>778569
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGC

I searched for it and tried this:

sed 's@.*__-__@@' input.fa > output.fa

That removed __-__ and everything before it, including the ">" that I wanted to keep.

I also tried this to remove everything between ">" and __-__

sed -e 's/\>//' -e 's/\__-__.*//' input.fa > output.fa

But this removed everything after __-__

And this, that removed __0.00_0.00

sed 's/__0.00.*$//' input.fa > output.fa

Thank you for your help.

sed fasta • 734 views
ADD COMMENTlink modified 18 months ago by Alex Reynolds28k • written 18 months ago by Harumi10
1

Now THIS is how you write a "please help me with fasta headers" question!

ADD REPLYlink written 18 months ago by Joe14k
4
gravatar for Alex Reynolds
18 months ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

Try the following regular expression:

$ awk '{ if ($0 ~ /^>/) { match($1, /[+|-]__[0-9]+__/, m); print ">"substr(m[0], 4, length(m[0]) - 5); } else { print $0; } }' input.fa > output.fa

Then:

$ less output.fa
>778568
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGCAGCCGGA
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAA
>778569
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGC
>830827
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGCAGCCGGA
GCTTCAATCCAGGGGATCGAGGAGATCCAAAGCAGCAGAAGCGGCTCGACGATGGTGAGG
ATTCGGGATCGGATTCAGCGCTCGTCGGGACTGG
>811706
GCTGGCGACGGATCTA
ADD COMMENTlink written 18 months ago by Alex Reynolds28k

It worked! Thank you for your help!

ADD REPLYlink written 18 months ago by Harumi10
1

You're quite welcome!

ADD REPLYlink written 18 months ago by Alex Reynolds28k
3
gravatar for cschu181
18 months ago by
cschu1811.8k
cschu1811.8k wrote:

This should work if your headers follow the pattern that you specified:

sed 's/[>_]\+/_/g' yourfile.fasta | cut -f 8 -d _ | sed 's/^\([0-9]\)/>\1/'
ADD COMMENTlink written 18 months ago by cschu1811.8k

It worked! Thank you for your help!

ADD REPLYlink written 18 months ago by Harumi10
3
gravatar for swbarnes2
18 months ago by
swbarnes26.5k
United States
swbarnes26.5k wrote:
test=">2p__scaffold_2__5799__6580__-__778568__0.00__0.00"
echo $test | sed 's@.*__[\+-]__@>@' | sed 's@__.*@@'

Some might find:

 sed 's/.*__[\+-]__/>/' | sed 's/__.*//'

to be a bit more readable.

ADD COMMENTlink modified 18 months ago • written 18 months ago by swbarnes26.5k
1

you will miss out on the __+__ cases with this one ;)

ADD REPLYlink written 18 months ago by lieven.sterck5.8k
2

Ah, didn't see that at first, edited my answer to fit that requirement

ADD REPLYlink modified 18 months ago • written 18 months ago by swbarnes26.5k

Thank you for your help!

I tried:

 sed 's/.*__[\+-]__/>/' | sed 's/__.*//' input.fa > output.fa

But it took a long time processing so I canceled.

When I tried:

sed 's/.*__[\+-]__/>/' input.fa > output.fa | sed 's/__.*//' output.fa > output2.fa

The output was empty.

Why does this happen?

Thank you again!!

ADD REPLYlink modified 18 months ago • written 18 months ago by Harumi10
1

You need to put the input file before the '|' symbol, so like this:

sed 's/.*__[\+-]__/>/' input.fa | sed 's/__.*//' > output.fa

otherwise it is just waiting for input (== why it is taking so long)

the second is a wrong syntax and will indeed never work. The data stream stopped at ' > output.fa' so any pipe or such behind it will not do anything (and create empty file as you mention)

ADD REPLYlink written 18 months ago by lieven.sterck5.8k

It worked! Thank you very much for your helpful explanation! I am still a beginner in bioinformatics

ADD REPLYlink written 18 months ago by Harumi10
1
gravatar for cpad0112
12 months ago by
cpad011211k
India
cpad011211k wrote:

a little late to the party:

$ sed '/>/ s/^.*__\(\w\+\)__.*/>\1/g' file.fa

or

$ sed '/>/ s/^\(\W\).*__\(\w\+\)__.*/\1\2/g' file.fa 


>778568
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGCAGCCGGA
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAA
>778569
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGC
>830827
GCTGGCGACGGATCTAGGCTCAGCGCAGAAGCAACTGAGAGTCGGCGATGAGCAGCCGGA
GCTTCAATCCAGGGGATCGAGGAGATCCAAAGCAGCAGAAGCGGCTCGACGATGGTGAGG
ATTCGGGATCGGATTCAGCGCTCGTCGGGACTGG
>811706
GCTGGCGACGGATCTA
ADD COMMENTlink modified 12 months ago • written 12 months ago by cpad011211k

a little late to the party:

Always very welcome, though.

ADD REPLYlink written 12 months ago by Kevin Blighe48k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1842 users visited in the last hour