Question: edit headers of fasta files
0
gravatar for erick_rc93
3 months ago by
erick_rc930
erick_rc930 wrote:

I have a directory with fasta files and these files have headers like this

> ID:WP_070393975.1 | [Moorea producens PAL-8-15-08-1] | PAL-8-15-08-1 | hypothetical protein | 351 | NZ_CP017599(9673108):5662931-5663281:-1 ^^ Moorea producens PAL-8-15-08-1 chromosome, complete genome.

First I wanted only the ID (WP_07039397531 for example) for each file, and then I did it with the next code line

for file in *.fna; do cut -d '|' -f1 $file  | grep ">" | sed 's/ID/ /g' | sed 's/[:>]//g' > "${file/.fna/_ids.txt}"; done

and I get the a list like the following, I would like to replace the number before ".1 " by "[0-9]"

WP_012167065.1 
 WP_015214247.1 
 WP_015083735.1 
 WP_035159822.1 
 WP_096595623.1 
 WP_096613742.1 
 WP_096613838.1 
 WP_096694933.1 
 WP_015201116.1 
 WP_015173923.1 
 ADB95635.1

The output will be the next list_ids.txt

 WP_01216706[0-9].1 
 WP_01521424[0-9].1 
 WP_01508373[0-9].1 
 WP_03515982[0-9].1 
 WP_09659562[0-9].1 
 WP_09661374[0-9].1

and then I want to do a grep with the next code line

for file in *.gbk; do  cat list_ids.txt | while read line; do grep -B 2  "$line" "$file"; done ; done

I hope you can help me.

sequence • 225 views
ADD COMMENTlink modified 3 months ago by sacha1.6k • written 3 months ago by erick_rc930

Just add another sed command to your first long pipe to do something like s/./[0-9]./g?

You may need to backslash escape the square brackets because they have a special meaning to sed.

ADD REPLYlink modified 3 months ago • written 3 months ago by jrj.healey9.1k

output:

$ sed '/>/ s/\..\s|\s.*//1' test.fa
> ID:WP_070393975
atgc
> ID:WP_070393975
tagc

input:

$ cat test.fa
> ID:WP_070393975.1 | [Moorea producens PAL-8-15-08-1] | PAL-8-15-08-1 | hypothetical protein | 351 | NZ_CP017599(9673108):5662931-5663281:-1 ^^ Moorea producens PAL-8-15-08-1 chromosome, complete genome.
atgc
> ID:WP_070393975.1 | [Moorea producens PAL-8-15-08-1] | PAL-8-15-08-1 | hypothetical protein | 351 | NZ_CP017599(9673108):5662931-5663281:-1 ^^ Moorea producens PAL-8-15-08-1 chromosome, complete genome.
tagc
ADD REPLYlink written 3 months ago by cpad011210k

That output is not what the OP is looking for cpad. It needs to have the string '[0-9]' prepended before the period is all.

ADD REPLYlink written 3 months ago by jrj.healey9.1k

jrj.healey You are right. Amended code below:

$ sed -n '/>/ s/>\s//g;s/.\(.\{2\}\)\s| .*/[0-9]\1/1p' test.fa 

ID:WP_07039397[0-9].1
ID:WP_07039397[0-9].1

input remains the same as OP above.

ADD REPLYlink modified 3 months ago • written 3 months ago by cpad011210k
0
gravatar for sacha
3 months ago by
sacha1.6k
France
sacha1.6k wrote:

I use seqkit for fasta manipulation

Try to select and replace fasta header with seqkit. Use grep and replace command using regular expression and capture. Something like this :

 seqkit grep -nr -p  "WP_\d+\.\d" test.fa|seqkit replace -p ".+(WP_\d+)\.(\d).+" -r '$1[0-9].$2'

output :

 >WP_070393975[0-9].1
 ACGTAA
  • seqkit grep -nr -p "WP_\d+.\d" test.fa => filter fasta by WP_xxxxx.x
  • seqkit replace -p ".+(WP_\d+).(\d).+" -r '$1[0-9].$2' => capture (WP_xxxx).(x) and replace by $1[0-9]$2
ADD COMMENTlink modified 3 months ago • written 3 months ago by sacha1.6k

sacha : Is this post incomplete?

ADD REPLYlink written 3 months ago by genomax59k

It is a mistake. I fixed it. Sorry

ADD REPLYlink written 3 months ago by sacha1.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1361 users visited in the last hour