Question: Rename fasta headers
2
gravatar for davkam30
2.8 years ago by
davkam3020
davkam3020 wrote:

Hi All,

Can someone help me to rename the following fasta headers

>M42.PS.NODE_36229_length_665_cov_0.0565371:i:367638.gene_3804
>M43.PS.NODE_26456_length_662_cov_0.0603908:i:416949.gene_16048
>M52.PS.NODE_39075_length_545_cov_0.186099:i:724154.gene_14948
>M53.PS.NODE_205_length_10936_cov_0.152441:i:729940.gene_3089

into this type this:

>M42_gene_3804
>M43_gene_16048
>M52_gene_14948
>M53_gene_3089

The fasta file has 80,0000 such headers! I am totally new :-) SO I would really appreciate away using "awk" or even "sed" for this purpose.

Thanks in advance.

Daudi

fasta • 1.7k views
ADD COMMENTlink modified 2.8 years ago by cpad011213k • written 2.8 years ago by davkam3020
2

You may be totally new, but I am sure you are not totally lazy. What have you tried? Any efforts so far?

You may do this with sed and capture groups, for example, to capture the first group:

sed "s|\(^>.*\)\.PS\.NODE_.*|\1_|" in.fas > out.fas

Explanation: s| for substitution, \( \) defines the capture group, ^anchors the search at the beginning of the line (unnecessary here, but doesn't hurt either), \1 outputs the captured group (you may use \2, \3, etc, for additional groups).

You can work out the rest, with a little help of google.

ADD REPLYlink written 2.8 years ago by h.mon30k

Thanks h.mon.

I had already tried out differently with "sed" but failed miserably. Next time I will post what I have tried. I appreciate the help.

ADD REPLYlink written 2.8 years ago by davkam3020

Well I've learned something new! I didn't know sed delimiting could be done with |, always thought it had to be a /!

ADD REPLYlink written 2.8 years ago by Joe16k
1

It can be (almost) anything. Read about the Leaning toothpick syndrome. Using other characters as delimiters aims at mitigating this syndrome.

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by h.mon30k
3
gravatar for Joe
2.8 years ago by
Joe16k
United Kingdom
Joe16k wrote:
$ cat headers.txt
>M42.PS.NODE_36229_length_665_cov_0.0565371:i:367638.gene_3804
>M43.PS.NODE_26456_length_662_cov_0.0603908:i:416949.gene_16048
>M52.PS.NODE_39075_length_545_cov_0.186099:i:724154.gene_14948
>M53.PS.NODE_205_length_10936_cov_0.152441:i:729940.gene_3089

$ cat headers.txt | sed 's/\..*\./_/'
>M42_gene_3804
>M43_gene_16048
>M52_gene_14948
>M53_gene_3089

Using sed, this says from a literal "." (\.) match any character (.), any number of times (*) until you meet another literal "." (\.), and substitute it with an underscore (/_/).

This works as, by default, sed is "greedy", which means it will always try to match the longest possible string/pattern it can, so in this case, it skips right over the periods it encounters in the middle of the string, until the last one.

Pretty confusing in this example because it needs literal periods and the special period character!

EDIT, I'll add the caveat that ALL of your headers maintain exactly the same format, else this'll break somewhere.

ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by Joe16k
1

This one is elegant! No need for cat, though.

ADD REPLYlink written 2.8 years ago by h.mon30k

Yeah I know ;) just a lazy copy and paste from my terminal to demonstrate the output.

@davkam30, as h.mon was getting at you can invoke this to edit your actual file like so:

sed -i 's/\..*\./_/' myfile.fasta

ADD REPLYlink written 2.8 years ago by Joe16k

Thanks jrj.healey! This one is straight forward and well explained. I am sure it will be useful to others.

ADD REPLYlink written 2.8 years ago by davkam3020

Go ahead and accept (green check mark) this answer to provide closure for this thread.

ADD REPLYlink written 2.8 years ago by genomax84k
1
gravatar for cpad0112
2.8 years ago by
cpad011213k
India
cpad011213k wrote:

Assuming that sequences in fasta file are linearized, try following code:

grep \> -A 1 headers.fa | cut -d. -f1,5 --output-delimiter _

Example fasta with example headers from above:

 $ cat headers.fa 
>M42.PS.NODE_36229_length_665_cov_0.0565371:i:367638.gene_3804
ATG
>M43.PS.NODE_26456_length_662_cov_0.0603908:i:416949.gene_16048
GAT
>M52.PS.NODE_39075_length_545_cov_0.186099:i:724154.gene_14948
CAT
>M53.PS.NODE_205_length_10936_cov_0.152441:i:729940.gene_3089
TGC

output:

 $  grep \> -A 1 headers.fa | cut -d. -f1,5 --output-delimiter _
>M42_gene_3804
ATG
>M43_gene_16048
GAT
>M52_gene_14948
CAT
>M53_gene_3089
TGC

Since the sequences are linearized, code cut -d. -f1,5 --output-delimiter _ input.fa should work straight away without grep, assuming that there are no full stop marks in sequences.

ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by cpad011213k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1006 users visited in the last hour