Question

Question: How do I change the fasta format from CLC Workbench for MG-RAST

0

Entering edit mode

9.2 years ago

djuna.gulliver • 0

I'm using a trial of CLC Workbench for assemblies. I would like to enter my assembled fa files into MG-RAST. However, CLC Workbench gives files in the form of:

>sequence_1 Average coverage: 5.6
ACCAGCGTTCTCTACACA
>sequence_2 Average coverage: 6.4
GTTATACAGGATAAGAATC

And so forth (of course, my contigs are much longer). MG-RAST request a format such as:

>sequence_1_[cov=5.6]
ACCAGCGTTCTCTACACA
>sequence_2_[cov=6.4]
GTTATACAGGATAAGAATC

It is easy enough to get half-way there, and a code below (where BG1.fa is my input file and BGcon.fa is the new output file):

<BG1.fa sed 's/ Average coverage: /_[cov=/g' >BG1con.fa

Gets me to the following fa format:

>sequence_1_[cov=5.6
ACCAGCGTTCTCTACACA
>sequence_2_[cov=6.4
GTTATACAGGATAAGAATC

But I just cannot get that last little bracket at the end. I've tried a couple of things, but it always puts the bracket on a new line such as:

>sequence_1_[cov=5.6
]
ACCAGCGTTCTCTACACA
>sequence_2_[cov=6.4
]
GTTATACAGGATAAGAATC

I must apologize, for I am brand new to the sed language, and it still is pretty confusing for me.

Any idea how to eloquently (or not) get the last bracket up?

Assembly • 2.7k views

ADD COMMENT • link updated 2.1 years ago by Ram 43k • written 9.2 years ago by djuna.gulliver • 0

Ram · Answer 1 · 2015-02-17

0

Entering edit mode

9.2 years ago

Devon Ryan 104k

...input commands... | awk '{if($0 ~ /^>/) {$1=$1"]"} print $0}' > output.fa

ADD COMMENT • link updated 2.1 years ago by Ram 43k • written 9.2 years ago by Devon Ryan 104k

0

Entering edit mode

You can very likely do this directly with sed too (I expect someone else will post that method).

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.2 years ago by Devon Ryan 104k

0

Entering edit mode

It's getting much closer. Both the awk language suggestion and the sed language suggestion resulted in the "]" being placed in the correct spot, but the fist line of sequence is then tagged on the end of the header. So it looks like:

>sequence_1_[cov=5.6]ACCAGCGTTCTCTACACA
ATTACACGGCACCCAC
>sequence_2_[cov=6.4 ]GTTATACAGGATAAGAATC
GGCCCACTATTATATCA

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.2 years ago by djuna.gulliver • 0

Ram · Answer 2 · 2015-02-17

0

Entering edit mode

9.2 years ago

Neilfws 49k

Awk works as Devon illustrated; the sed solution is:

sed -E 's/ Average coverage: (.+)/_[cov=\1]/' BG1.fa > BG1con.fa

The -E switch enables extended regular expressions; the \1 refers to everything that was captured following "Average coverage: ", so assumes that no header lines contain anything after the coverage value.

Solution was found here.

ADD COMMENT • link updated 2.1 years ago by Ram 43k • written 9.2 years ago by Neilfws 49k

0

Entering edit mode

It's getting much closer. Both the awk language suggestion and the sed language suggestion resulted in the "]" being placed in the correct spot, but the first line of sequence is then tagged on the end of the header. So it looks like:

>sequence_1_[cov=5.6]ACCAGCGTTCTCTACACA
ATTACACGGCACCCAC
>sequence_2_[cov=6.4 ]GTTATACAGGATAAGAATC
GGCCCACTATTATATCA

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.2 years ago by djuna.gulliver • 0

0

Entering edit mode

Not on my (Ubuntu Linux) machine. Could be a line endings issue with your OS.

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.2 years ago by Neilfws 49k

Ram · Answer 3 · 2015-02-24

0

Entering edit mode

9.2 years ago

djuna.gulliver • 0

Got it!

sed 's/ Average coverage: /_[cov=/g' BG.fa | sed 's/[0-9].[0-9][0-9]*/&]/g' >BGcon.fa

That took way longer for me to figure out than I'll ever admit to my supervisor.

Well, my day can only go downhill from here, I should just call it a day.

ADD COMMENT • link updated 2.1 years ago by Ram 43k • written 9.2 years ago by djuna.gulliver • 0