Question: Question: How do I change the fasta format from CLC Workbench for MG-RAST
0
gravatar for djuna.gulliver
5.7 years ago by
United States
djuna.gulliver0 wrote:

I'm using a trial of CLC Workbench for assemblies. I would like to enter my assembled fa files into MG-RAST.  However, CLC Workbench gives files in the form of:

>sequence_1 Average coverage: 5.6
ACCAGCGTTCTCTACACA
>sequence_2 Average coverage: 6.4
GTTATACAGGATAAGAATC

And so forth (of course, my contigs are much longer). MG-RAST request a format such as:

>sequence_1_[cov=5.6]
ACCAGCGTTCTCTACACA
>sequence_2_[cov=6.4]
GTTATACAGGATAAGAATC

It is easy enough to get half-way there, and a code below (where BG1.fa is my input file and BGcon.fa is the new output file):

<BG1.fa sed 's/ Average coverage: /_[cov=/g' >BG1con.fa

Gets me to the following fa format:

>sequence_1_[cov=5.6
ACCAGCGTTCTCTACACA
>sequence_2_[cov=6.4
GTTATACAGGATAAGAATC

But I just cannot get that last little bracket at the end.  I've tried a couple of things, but it always puts the bracket on a new line such as:

>sequence_1_[cov=5.6
]
ACCAGCGTTCTCTACACA
>sequence_2_[cov=6.4
]
GTTATACAGGATAAGAATC

I must apologize, for I am brand new to the sed language, and it still is pretty confusing for me.

Any idea how to eloquently (or not) get the last bracket up?

 

 

assembly • 1.7k views
ADD COMMENTlink modified 5.7 years ago • written 5.7 years ago by djuna.gulliver0
0
gravatar for Devon Ryan
5.7 years ago by
Devon Ryan97k
Freiburg, Germany
Devon Ryan97k wrote:
...input commands... | awk '{if($0 ~ /^>/) {$1=$1"]"} print $0}' > output.fa
ADD COMMENTlink modified 5.7 years ago • written 5.7 years ago by Devon Ryan97k

You can very likely do this directly with sed too (I expect someone else will post that method).

ADD REPLYlink written 5.7 years ago by Devon Ryan97k

It's getting much closer. Both the awk language suggestion and the sed language suggestion resulted in the "]" being placed in the correct spot, but the fist line of sequence is then tagged on the end of the header.  So it looks like:

>sequence_1_[cov=5.6]ACCAGCGTTCTCTACACA
ATTACACGGCACCCAC
>sequence_2_[cov=6.4 ]GTTATACAGGATAAGAATC
GGCCCACTATTATATCA

 

ADD REPLYlink written 5.7 years ago by djuna.gulliver0
0
gravatar for Neilfws
5.7 years ago by
Neilfws49k
Sydney, Australia
Neilfws49k wrote:

Awk works as Devon illustrated; the sed solution is:

sed -E 's/ Average coverage: (.+)/_[cov=\1]/' BG1.fa > BG1con.fa

The -E switch enables extended regular expressions; the \1 refers to everything that was captured following "Average coverage: ", so assumes that no header lines contain anything after the coverage value.

Solution was found here.

ADD COMMENTlink written 5.7 years ago by Neilfws49k

It's getting much closer. Both the awk language suggestion and the sed language suggestion resulted in the "]" being placed in the correct spot, but the fist line of sequence is then tagged on the end of the header.  So it looks like:

>sequence_1_[cov=5.6]ACCAGCGTTCTCTACACA
ATTACACGGCACCCAC
>sequence_2_[cov=6.4 ]GTTATACAGGATAAGAATC
GGCCCACTATTATATCA
ADD REPLYlink written 5.7 years ago by djuna.gulliver0

Not on my (Ubuntu Linux) machine. Could be a line endings issue with your OS.

ADD REPLYlink written 5.7 years ago by Neilfws49k
0
gravatar for djuna.gulliver
5.7 years ago by
United States
djuna.gulliver0 wrote:

Got it!

sed 's/ Average coverage: /_[cov=/g' BG.fa | sed 's/[0-9].[0-9][0-9]*/&]/g' >BGcon.fa

That took way longer for me to figure out than I'll ever admit to my supervisor.

Well, my day can only go downhill from here, I should just call it a day.

ADD COMMENTlink written 5.7 years ago by djuna.gulliver0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2016 users visited in the last hour