Question: Rename the fasta entries in Unix or R
0
gravatar for horsedog
9 weeks ago by
horsedog20
horsedog20 wrote:

I'd like to change the entries of each fasta file

from:

gi|556503834|ref|NC_000913.3|Escherichia coli str. K-12 substr. MG1655, complete genome

to:

Escherichia_coli_str._K-12_substr._MG1655

which means i want to remove the accession number and just want to keep the species name, at the same time all the space is replaced by underscore. either R or unix is ok.

Thank you very much.

R genome • 291 views
ADD COMMENTlink modified 9 weeks ago by cpad01123.1k • written 9 weeks ago by horsedog20
2

Always mention what you've tried. Your questions suggests that you just want an answer and are not interested in learning how to get there, which should not be how anyone approaches this.

ADD REPLYlink written 9 weeks ago by Ram12k
1
gravatar for Macspider
9 weeks ago by
Macspider1.6k
Vienna - BOKU
Macspider1.6k wrote:

I would strongly suggest you to use bioawk for these operations. It is really handy.

bioawk -c fastx '{split($name, a, "|"); print ">"a[5]"\n"$seq}' file.fa | tr " " "_"

This should do. Have a look at install bioawk in unix system

ADD COMMENTlink modified 9 weeks ago • written 9 weeks ago by Macspider1.6k
1
gravatar for Sej Modha
9 weeks ago by
Sej Modha2.2k
Glasgow, UK
Sej Modha2.2k wrote:

Simple bash solution:

cat file.fa |awk -F'[|,]' '{print $1$5}' | sed -e 's/ /_/g;s/gi//g'
ADD COMMENTlink written 9 weeks ago by Sej Modha2.2k
awk  -F '[/^>|,]' 'NF>1{gsub(" ","_",$6);print ">"$6} {print $1}'  test1.fa | awk NF

input:

$ cat test1.fa 
>gi|556503834|ref|NC_000913.3|Escherichia coli str. K-12 substr. MG1655, complete genome
ATCGT
>gi|556503835|ref|NC_000913.3|Escherichia coli str. K-12 substr. MG1655, complete genome
ATCGT
ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by cpad01123.1k
1
gravatar for Pierre Lindenbaum
9 weeks ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum101k wrote:
awk -F '|' '/^>/ {s=$5; gsub(/,.*/,"",s);gsub(/ /,"_",s); printf(">%s\n",s);next;} {print;}' input.fa

ex:

~$ echo -e '>gi|556503834|ref|NC_000913.3|Escherichia coli str. K-12 substr. MG1655, complete genome\nATGC' | awk -F '|' '/^>/ {s=$5; gsub(/,.*/,"",s);gsub(/ /,"_",s); printf(">%s\n",s);next;} {print;}'
>Escherichia_coli_str._K-12_substr._MG1655
ATGC
ADD COMMENTlink modified 9 weeks ago • written 9 weeks ago by Pierre Lindenbaum101k
1
gravatar for Jacob Warner
9 weeks ago by
Jacob Warner500
Jacob Warner500 wrote:

Adding an R solution for people who hate the speed of awk!

library(Biostrings)
library(dplyr)

fasta <- readDNAStringSet(filepath = 'test.fa', format="fasta")
names(fasta)
##[1] "gi|556503834|ref|NC_000913.3|Escherichia coli str. blah blah FIRST SEQ" 
##[2] "gi|556503834|ref|NC_000913.3|Escherichia coli str. blah blah SECOND SEQ"
##[3] "gi|556503834|ref|NC_000913.3|Escherichia coli str. blah blah THIRD SEQ"

names(fasta) <- 
  names(fasta) %>%
  strsplit(., split="|",fixed=TRUE) %>%
  sapply(., '[', 5) %>%
  gsub(" ", "_",.)

names(fasta)
##[1] "Escherichia_coli_str._blah_blah_FIRST_SEQ" 
##[2] "Escherichia_coli_str._blah_blah_SECOND_SEQ"
##[3] "Escherichia_coli_str._blah_blah_THIRD_SEQ" 

writeXStringSet(fasta, filepath = 'test_EDITED.fa',format="fasta")
ADD COMMENTlink written 9 weeks ago by Jacob Warner500
2

for people who hate the speed of awk

dat sarcasm tho :D

ADD REPLYlink written 9 weeks ago by Macspider1.6k

Another R solution for test.fa:

test.fa: sequence is copied twice to show that script is general and works with fasta with multiple sequences:

$ cat test.fa 
>gi|556503834|ref|NC_000913.3|Escherichia coli str. K-12 substr. MG1655, complete genome
ATCGT
>gi|556503835|ref|NC_000913.3|Escherichia coli str. K-12 substr. MG1655, complete genome
GTCTGG

R code:

library(Biostrings)
library(stringr)
fasta <- readDNAStringSet(filepath = 'test.fa', format="fasta")
names(fasta)=gsub(" ","_",str_split_fixed(str_split_fixed(names(fasta),"\\|",5)[,5],",",2)[,1])
writeXStringSet(fasta, filepath = 'test_edited.fa',format="fasta")
ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by cpad01123.1k
0
gravatar for jrj.healey
9 weeks ago by
jrj.healey2.9k
United Kingdom
jrj.healey2.9k wrote:

Brain isn't functioning well enough to make one regex out of this, but it's basically just 2 string removals, and a transliteration (whitespace to underscore

$ echo "gi|556503834|ref|NC_000913.3|Escherichia coli str. K-12 substr. MG1655, complete genome" | sed -e 's/.*|//' -e 's/,.*//' | tr ' ' '_'

Yeilds

Escherichia_coli_str._K-12_substr._MG1655

Obviously just change echo to cat if you're dealing with a file.

ADD COMMENTlink modified 9 weeks ago • written 9 weeks ago by jrj.healey2.9k
0
gravatar for cpad0112
9 weeks ago by
cpad01123.1k
cpad01123.1k wrote:
$ cat test.fa 
>gi|556503834|ref|NC_000913.3|Escherichia coli str. K-12 substr. MG1655, complete genome
ATCGT
>gi|556503835|ref|NC_000913.3|Escherichia coli str. K-12 substr. MG1655, complete genome
ATCGT

code and output:

$ sed -re '/>/ s/.*\|(.*),.*/>\1/' -e 's/ /_/g' test1.fa 
>Escherichia_coli_str._K-12_substr._MG1655
ATCGT
>Escherichia_coli_str._K-12_substr._MG1655
ATCGT

To make a general script that works with fasta with one or more than one sequences, i copy/pasted the same sequence twice.

ADD COMMENTlink modified 9 weeks ago • written 9 weeks ago by cpad01123.1k

Close, but you're missing the transliteration from space to underscore the OP wants ;)

ADD REPLYlink written 9 weeks ago by jrj.healey2.9k

Thanks and updated the code.

ADD REPLYlink written 9 weeks ago by cpad01123.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1079 users visited in the last hour