You may be totally new, but I am sure you are not totally lazy. What have you tried? Any efforts so far?
You may do this with sed and capture groups, for example, to capture the first group:
sed "s|\(^>.*\)\.PS\.NODE_.*|\1_|" in.fas > out.fas
Explanation: s| for substitution, \(\) defines the capture group, ^anchors the search at the beginning of the line (unnecessary here, but doesn't hurt either), \1 outputs the captured group (you may use \2, \3, etc, for additional groups).
You can work out the rest, with a little help of google.
Using sed, this says from a literal "." (\.) match any character (.), any number of times (*) until you meet another literal "." (\.), and substitute it with an underscore (/_/).
This works as, by default, sed is "greedy", which means it will always try to match the longest possible string/pattern it can, so in this case, it skips right over the periods it encounters in the middle of the string, until the last one.
Pretty confusing in this example because it needs literal periods and the special period character!
EDIT, I'll add the caveat that ALL of your headers maintain exactly the same format, else this'll break somewhere.
$ grep \> -A 1 headers.fa | cut -d. -f1,5 --output-delimiter _
>M42_gene_3804
ATG
>M43_gene_16048
GAT
>M52_gene_14948
CAT
>M53_gene_3089
TGC
Since the sequences are linearized, code cut -d. -f1,5 --output-delimiter _ input.fa should work straight away without grep, assuming that there are no full stop marks in sequences.
You may be totally new, but I am sure you are not totally lazy. What have you tried? Any efforts so far?
You may do this with sed and capture groups, for example, to capture the first group:
Explanation:
s|
for substitution,\(
\)
defines the capture group,^
anchors the search at the beginning of the line (unnecessary here, but doesn't hurt either),\1
outputs the captured group (you may use\2
,\3
, etc, for additional groups).You can work out the rest, with a little help of google.
Thanks h.mon.
I had already tried out differently with "sed" but failed miserably. Next time I will post what I have tried. I appreciate the help.
Well I've learned something new! I didn't know sed delimiting could be done with
|
, always thought it had to be a/
!It can be (almost) anything. Read about the Leaning toothpick syndrome. Using other characters as delimiters aims at mitigating this syndrome.