Question: Alternative problem with editing fasta file headers - keeping Taxon name in brackets
0
gravatar for Micro_Warwick
3.4 years ago by
Micro_Warwick0 wrote:

Hi

I have the following headers for my fasta files downloaded from IMG/JGI

2648318750 Ga0098755_14192 DNA gyrase subunit B [Microbacterium sp. GCS4 : Ga0098755_14]

I would like this:

Microbacterium sp. GCS4 : Ga0098755_14

The strings/characters are all different for each header. I found this to try:

sed 's/.[([^]])].*/\1/g'

Its works, but I need to keep the '>' at the start to obviously denote each sequence in the fasta file. Is there some parenthesis I can add to keep this character alongside my current command?

Cheers in advance!

sequencing blast sequence fasta • 940 views
ADD COMMENTlink modified 3.4 years ago by Pierre Lindenbaum122k • written 3.4 years ago by Micro_Warwick0

Yes, you may use another pair of parenthesis to catch the '>' at the beginning of the line. On a Debian system I also have to use option '-r' to allow referencing of subpatterns.

echo  '>2648318750 Ga0098755_14192 DNA gyrase subunit B [Microbacterium sp. GCS4 : Ga0098755_14]'  \
| sed -r 's/^(>)[^]]*\[([^]]*)\].*/\1\2/g'
ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by piet1.7k
0
gravatar for Pierre Lindenbaum
3.4 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum122k wrote:

Using awk ?

 awk  '/^>/ {i=index($0,"[");j=index($0,"]");print ">" substr($0,i+1,(j-i)-1); next;} {print;}' in.fa > out.fa
ADD COMMENTlink written 3.4 years ago by Pierre Lindenbaum122k

Sorry, that just ended up with a blank file? The file contains 20, 000 sequences, not just one if that means something else needs to be added

ADD REPLYlink written 3.4 years ago by Micro_Warwick0

there is something wrong in the way you run my awk script:

$ echo -e '>2648318750 Ga0098755_14192 DNA gyrase subunit B [Microbacterium sp. GCS4 : Ga0098755_14]\nATACGACGATCGT\n>2648318750 Ga0098755_14192 DNA gyrase subunit B [Microbacterium sp. GCS4 : Ga0098755_14]\nATACGACGATCGT' | awk '/^>/ {i=index($0,"[");j=index($0,"]");print ">" substr($0,i+1,(j-i)-1); next;} {print;}'

>Microbacterium sp. GCS4 : Ga0098755_14
ATACGACGATCGT
>Microbacterium sp. GCS4 : Ga0098755_14
ATACGACGATCGT
ADD REPLYlink written 3.4 years ago by Pierre Lindenbaum122k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 675 users visited in the last hour