Alternative problem with editing fasta file headers - keeping Taxon name in brackets
1
0
Entering edit mode
8.2 years ago

Hi

I have the following headers for my fasta files downloaded from IMG/JGI

2648318750 Ga0098755_14192 DNA gyrase subunit B [Microbacterium sp. GCS4 : Ga0098755_14]

I would like this:

Microbacterium sp. GCS4 : Ga0098755_14

The strings/characters are all different for each header. I found this to try:

sed 's/.[([^]])].*/\1/g'

Its works, but I need to keep the '>' at the start to obviously denote each sequence in the fasta file. Is there some parenthesis I can add to keep this character alongside my current command?

Cheers in advance!

sequencing blast sequence fasta • 1.8k views
ADD COMMENT
0
Entering edit mode

Yes, you may use another pair of parenthesis to catch the '>' at the beginning of the line. On a Debian system I also have to use option '-r' to allow referencing of subpatterns.

echo  '>2648318750 Ga0098755_14192 DNA gyrase subunit B [Microbacterium sp. GCS4 : Ga0098755_14]'  \
| sed -r 's/^(>)[^]]*\[([^]]*)\].*/\1\2/g'
ADD REPLY
0
Entering edit mode
8.2 years ago

Using awk ?

 awk  '/^>/ {i=index($0,"[");j=index($0,"]");print ">" substr($0,i+1,(j-i)-1); next;} {print;}' in.fa > out.fa
ADD COMMENT
0
Entering edit mode

Sorry, that just ended up with a blank file? The file contains 20, 000 sequences, not just one if that means something else needs to be added

ADD REPLY
0
Entering edit mode

there is something wrong in the way you run my awk script:

$ echo -e '>2648318750 Ga0098755_14192 DNA gyrase subunit B [Microbacterium sp. GCS4 : Ga0098755_14]\nATACGACGATCGT\n>2648318750 Ga0098755_14192 DNA gyrase subunit B [Microbacterium sp. GCS4 : Ga0098755_14]\nATACGACGATCGT' | awk '/^>/ {i=index($0,"[");j=index($0,"]");print ">" substr($0,i+1,(j-i)-1); next;} {print;}'

>Microbacterium sp. GCS4 : Ga0098755_14
ATACGACGATCGT
>Microbacterium sp. GCS4 : Ga0098755_14
ATACGACGATCGT
ADD REPLY

Login before adding your answer.

Traffic: 2205 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6