Question: Rename the fasta entries in Unix or R
0
gravatar for horsedog
9 months ago by
horsedog30
horsedog30 wrote:

I'd like to change the entries of each fasta file

from:

gi|556503834|ref|NC_000913.3|Escherichia coli str. K-12 substr. MG1655, complete genome

to:

Escherichia_coli_str._K-12_substr._MG1655

which means i want to remove the accession number and just want to keep the species name, at the same time all the space is replaced by underscore. either R or unix is ok.

Thank you very much.

R genome • 496 views
ADD COMMENTlink modified 9 months ago by cpad01126.3k • written 9 months ago by horsedog30
2

Always mention what you've tried. Your questions suggests that you just want an answer and are not interested in learning how to get there, which should not be how anyone approaches this.

ADD REPLYlink written 9 months ago by Ram15k
1
gravatar for Macspider
9 months ago by
Macspider2.4k
Vienna - BOKU
Macspider2.4k wrote:

I would strongly suggest you to use bioawk for these operations. It is really handy.

bioawk -c fastx '{split($name, a, "|"); print ">"a[5]"\n"$seq}' file.fa | tr " " "_"

This should do. Have a look at install bioawk in unix system

ADD COMMENTlink modified 9 months ago • written 9 months ago by Macspider2.4k
1
gravatar for Sej Modha
9 months ago by
Sej Modha2.9k
Glasgow, UK
Sej Modha2.9k wrote:

Simple bash solution:

cat file.fa |awk -F'[|,]' '{print $1$5}' | sed -e 's/ /_/g;s/gi//g'
ADD COMMENTlink written 9 months ago by Sej Modha2.9k
awk  -F '[/^>|,]' 'NF>1{gsub(" ","_",$6);print ">"$6} {print $1}'  test1.fa | awk NF

input:

$ cat test1.fa 
>gi|556503834|ref|NC_000913.3|Escherichia coli str. K-12 substr. MG1655, complete genome
ATCGT
>gi|556503835|ref|NC_000913.3|Escherichia coli str. K-12 substr. MG1655, complete genome
ATCGT
ADD REPLYlink modified 9 months ago • written 9 months ago by cpad01126.3k
1
gravatar for Pierre Lindenbaum
9 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum108k wrote:
awk -F '|' '/^>/ {s=$5; gsub(/,.*/,"",s);gsub(/ /,"_",s); printf(">%s\n",s);next;} {print;}' input.fa

ex:

~$ echo -e '>gi|556503834|ref|NC_000913.3|Escherichia coli str. K-12 substr. MG1655, complete genome\nATGC' | awk -F '|' '/^>/ {s=$5; gsub(/,.*/,"",s);gsub(/ /,"_",s); printf(">%s\n",s);next;} {print;}'
>Escherichia_coli_str._K-12_substr._MG1655
ATGC
ADD COMMENTlink modified 9 months ago • written 9 months ago by Pierre Lindenbaum108k
1
gravatar for Jacob Warner
9 months ago by
Jacob Warner600
Jacob Warner600 wrote:

Adding an R solution for people who hate the speed of awk!

library(Biostrings)
library(dplyr)

fasta <- readDNAStringSet(filepath = 'test.fa', format="fasta")
names(fasta)
##[1] "gi|556503834|ref|NC_000913.3|Escherichia coli str. blah blah FIRST SEQ" 
##[2] "gi|556503834|ref|NC_000913.3|Escherichia coli str. blah blah SECOND SEQ"
##[3] "gi|556503834|ref|NC_000913.3|Escherichia coli str. blah blah THIRD SEQ"

names(fasta) <- 
  names(fasta) %>%
  strsplit(., split="|",fixed=TRUE) %>%
  sapply(., '[', 5) %>%
  gsub(" ", "_",.)

names(fasta)
##[1] "Escherichia_coli_str._blah_blah_FIRST_SEQ" 
##[2] "Escherichia_coli_str._blah_blah_SECOND_SEQ"
##[3] "Escherichia_coli_str._blah_blah_THIRD_SEQ" 

writeXStringSet(fasta, filepath = 'test_EDITED.fa',format="fasta")
ADD COMMENTlink written 9 months ago by Jacob Warner600
2

for people who hate the speed of awk

dat sarcasm tho :D

ADD REPLYlink written 9 months ago by Macspider2.4k

Another R solution for test.fa:

test.fa: sequence is copied twice to show that script is general and works with fasta with multiple sequences:

$ cat test.fa 
>gi|556503834|ref|NC_000913.3|Escherichia coli str. K-12 substr. MG1655, complete genome
ATCGT
>gi|556503835|ref|NC_000913.3|Escherichia coli str. K-12 substr. MG1655, complete genome
GTCTGG

R code:

library(Biostrings)
library(stringr)
fasta <- readDNAStringSet(filepath = 'test.fa', format="fasta")
names(fasta)=gsub(" ","_",str_split_fixed(str_split_fixed(names(fasta),"\\|",5)[,5],",",2)[,1])
writeXStringSet(fasta, filepath = 'test_edited.fa',format="fasta")
ADD REPLYlink modified 9 months ago • written 9 months ago by cpad01126.3k
0
gravatar for jrj.healey
9 months ago by
jrj.healey4.6k
United Kingdom
jrj.healey4.6k wrote:

Brain isn't functioning well enough to make one regex out of this, but it's basically just 2 string removals, and a transliteration (whitespace to underscore

$ echo "gi|556503834|ref|NC_000913.3|Escherichia coli str. K-12 substr. MG1655, complete genome" | sed -e 's/.*|//' -e 's/,.*//' | tr ' ' '_'

Yeilds

Escherichia_coli_str._K-12_substr._MG1655

Obviously just change echo to cat if you're dealing with a file.

ADD COMMENTlink modified 9 months ago • written 9 months ago by jrj.healey4.6k
0
gravatar for cpad0112
9 months ago by
cpad01126.3k
cpad01126.3k wrote:
$ cat test.fa 
>gi|556503834|ref|NC_000913.3|Escherichia coli str. K-12 substr. MG1655, complete genome
ATCGT
>gi|556503835|ref|NC_000913.3|Escherichia coli str. K-12 substr. MG1655, complete genome
ATCGT

code and output:

$ sed -re '/>/ s/.*\|(.*),.*/>\1/' -e 's/ /_/g' test1.fa 
>Escherichia_coli_str._K-12_substr._MG1655
ATCGT
>Escherichia_coli_str._K-12_substr._MG1655
ATCGT

To make a general script that works with fasta with one or more than one sequences, i copy/pasted the same sequence twice.

ADD COMMENTlink modified 9 months ago • written 9 months ago by cpad01126.3k

Close, but you're missing the transliteration from space to underscore the OP wants ;)

ADD REPLYlink written 9 months ago by jrj.healey4.6k

Thanks and updated the code.

ADD REPLYlink written 9 months ago by cpad01126.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1034 users visited in the last hour