Question

Rename fasta headers

2

Entering edit mode

7.9 years ago

davkam30 ▴ 20

Hi All,

Can someone help me to rename the following fasta headers

>M42.PS.NODE_36229_length_665_cov_0.0565371:i:367638.gene_3804
>M43.PS.NODE_26456_length_662_cov_0.0603908:i:416949.gene_16048
>M52.PS.NODE_39075_length_545_cov_0.186099:i:724154.gene_14948
>M53.PS.NODE_205_length_10936_cov_0.152441:i:729940.gene_3089

into this type this:

>M42_gene_3804
>M43_gene_16048
>M52_gene_14948
>M53_gene_3089

The fasta file has 80,0000 such headers! I am totally new :-) SO I would really appreciate away using "awk" or even "sed" for this purpose.

Thanks in advance.

Daudi

fasta • 5.0k views

ADD COMMENT • link updated 7.9 years ago by cpad0112 21k • written 7.9 years ago by davkam30 ▴ 20

2

Entering edit mode

You may be totally new, but I am sure you are not totally lazy. What have you tried? Any efforts so far?

You may do this with sed and capture groups, for example, to capture the first group:

sed "s|\(^>.*\)\.PS\.NODE_.*|\1_|" in.fas > out.fas

Explanation: s| for substitution, \( \) defines the capture group, ^anchors the search at the beginning of the line (unnecessary here, but doesn't hurt either), \1 outputs the captured group (you may use \2, \3, etc, for additional groups).

You can work out the rest, with a little help of google.

ADD REPLY • link 7.9 years ago by h.mon 35k

0

Entering edit mode

Thanks h.mon.

I had already tried out differently with "sed" but failed miserably. Next time I will post what I have tried. I appreciate the help.

ADD REPLY • link 7.9 years ago by davkam30 ▴ 20

0

Entering edit mode

Well I've learned something new! I didn't know sed delimiting could be done with |, always thought it had to be a /!

ADD REPLY • link 7.9 years ago by Joe 22k

1

Entering edit mode

It can be (almost) anything. Read about the Leaning toothpick syndrome. Using other characters as delimiters aims at mitigating this syndrome.

ADD REPLY • link 7.9 years ago by h.mon 35k

score 3 · Answer 1 · 2017-08-28

3

Entering edit mode

7.9 years ago

Joe 22k

$ cat headers.txt
>M42.PS.NODE_36229_length_665_cov_0.0565371:i:367638.gene_3804
>M43.PS.NODE_26456_length_662_cov_0.0603908:i:416949.gene_16048
>M52.PS.NODE_39075_length_545_cov_0.186099:i:724154.gene_14948
>M53.PS.NODE_205_length_10936_cov_0.152441:i:729940.gene_3089

$ cat headers.txt | sed 's/\..*\./_/'
>M42_gene_3804
>M43_gene_16048
>M52_gene_14948
>M53_gene_3089

Using sed, this says from a literal "." (\.) match any character (.), any number of times (*) until you meet another literal "." (\.), and substitute it with an underscore (/_/).

This works as, by default, sed is "greedy", which means it will always try to match the longest possible string/pattern it can, so in this case, it skips right over the periods it encounters in the middle of the string, until the last one.

Pretty confusing in this example because it needs literal periods and the special period character!

EDIT, I'll add the caveat that ALL of your headers maintain exactly the same format, else this'll break somewhere.

ADD COMMENT • link 7.9 years ago by Joe 22k

1

Entering edit mode

This one is elegant! No need for cat, though.

ADD REPLY • link 7.9 years ago by h.mon 35k

0

Entering edit mode

Yeah I know ;) just a lazy copy and paste from my terminal to demonstrate the output.

@davkam30, as h.mon was getting at you can invoke this to edit your actual file like so:

sed -i 's/\..*\./_/' myfile.fasta

ADD REPLY • link 7.9 years ago by Joe 22k

0

Entering edit mode

Thanks jrj.healey! This one is straight forward and well explained. I am sure it will be useful to others.

ADD REPLY • link 7.9 years ago by davkam30 ▴ 20

0

Entering edit mode

Go ahead and accept (green check mark) this answer to provide closure for this thread.

ADD REPLY • link 7.9 years ago by GenoMax 152k

score 1 · Answer 2 · 2017-08-28

Assuming that sequences in fasta file are linearized, try following code:

grep \> -A 1 headers.fa | cut -d. -f1,5 --output-delimiter _

Example fasta with example headers from above:

 $ cat headers.fa 
>M42.PS.NODE_36229_length_665_cov_0.0565371:i:367638.gene_3804
ATG
>M43.PS.NODE_26456_length_662_cov_0.0603908:i:416949.gene_16048
GAT
>M52.PS.NODE_39075_length_545_cov_0.186099:i:724154.gene_14948
CAT
>M53.PS.NODE_205_length_10936_cov_0.152441:i:729940.gene_3089
TGC

output:

 $  grep \> -A 1 headers.fa | cut -d. -f1,5 --output-delimiter _
>M42_gene_3804
ATG
>M43_gene_16048
GAT
>M52_gene_14948
CAT
>M53_gene_3089
TGC

Since the sequences are linearized, code cut -d. -f1,5 --output-delimiter _ input.fa should work straight away without grep, assuming that there are no full stop marks in sequences.