Question

Alternative problem with editing fasta file headers - keeping Taxon name in brackets

0

Entering edit mode

8.2 years ago

Micro_Warwick • 0

Hi

I have the following headers for my fasta files downloaded from IMG/JGI

2648318750 Ga0098755_14192 DNA gyrase subunit B [Microbacterium sp. GCS4 : Ga0098755_14]

I would like this:

Microbacterium sp. GCS4 : Ga0098755_14

The strings/characters are all different for each header. I found this to try:

sed 's/.[([^]])].*/\1/g'

Its works, but I need to keep the '>' at the start to obviously denote each sequence in the fasta file. Is there some parenthesis I can add to keep this character alongside my current command?

Cheers in advance!

sequencing blast sequence fasta • 1.8k views

ADD COMMENT • link updated 8.2 years ago by Pierre Lindenbaum 161k • written 8.2 years ago by Micro_Warwick • 0

0

Entering edit mode

Yes, you may use another pair of parenthesis to catch the '>' at the beginning of the line. On a Debian system I also have to use option '-r' to allow referencing of subpatterns.

echo  '>2648318750 Ga0098755_14192 DNA gyrase subunit B [Microbacterium sp. GCS4 : Ga0098755_14]'  \
| sed -r 's/^(>)[^]]*\[([^]]*)\].*/\1\2/g'

ADD REPLY • link 8.2 years ago by piet ★ 1.8k

score 0 · Answer 1 · 2016-03-07

0

Entering edit mode

8.2 years ago

Pierre Lindenbaum 161k

Using awk ?

 awk  '/^>/ {i=index($0,"[");j=index($0,"]");print ">" substr($0,i+1,(j-i)-1); next;} {print;}' in.fa > out.fa

ADD COMMENT • link 8.2 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Sorry, that just ended up with a blank file? The file contains 20, 000 sequences, not just one if that means something else needs to be added

ADD REPLY • link 8.2 years ago by Micro_Warwick • 0

0

Entering edit mode

there is something wrong in the way you run my awk script:

$ echo -e '>2648318750 Ga0098755_14192 DNA gyrase subunit B [Microbacterium sp. GCS4 : Ga0098755_14]\nATACGACGATCGT\n>2648318750 Ga0098755_14192 DNA gyrase subunit B [Microbacterium sp. GCS4 : Ga0098755_14]\nATACGACGATCGT' | awk '/^>/ {i=index($0,"[");j=index($0,"]");print ">" substr($0,i+1,(j-i)-1); next;} {print;}'

>Microbacterium sp. GCS4 : Ga0098755_14
ATACGACGATCGT
>Microbacterium sp. GCS4 : Ga0098755_14
ATACGACGATCGT

ADD REPLY • link 8.2 years ago by Pierre Lindenbaum 161k