Question: Replace names in FASTA file with a known character string from a text file
gravatar for beausoleilmo
2.2 years ago by
McGill University
beausoleilmo230 wrote:

I have a code that is creating me a script where I can change the name of the titles of the sequences in a FASTA file.

This is the text file I'm using:

#Assembly   Genome Center name  RefSeq Accession.version    GenBank Accession.version   NCBI name
GeoFor_1.0  scaffold40  NW_005054297    JH739887    GPS_002009865
GeoFor_1.0  scaffold112 NW_005054298    JH739888    GPS_002009866
GeoFor_1.0  scaffold41  NW_005054299    JH739889    GPS_002009867
GeoFor_1.0  scaffold130 NW_005054300    JH739890    GPS_002009868
GeoFor_1.0  scaffold54  NW_005054301    JH739891    GPS_002009869
GeoFor_1.0  scaffold16  NW_005054302    JH739892    GPS_002009870

This is the FASTA file that I'm using and that I want to change the names. AS you can see, I want to find the scaffold names that match the different JH######.

>Scaffold410    275
>Scaffold430    374
>Scaffold1010   597

Thesis the command I use to create a script to change the names

tail -n +2 scaffold_names_2.txt | while read assemb gcenter refseq genbank ncbi; do echo -ne "sed 's/[[:<:]]$gcenter[[:>:]]/$genbank/g' | " >>; done

The problem is that I'm not able to save the fasta file with the new names.

This is the last line of my script:

... sed 's/[[:<:]]scaffold4469[[:>:]]/JH767125/g'  name.fasta.fa

The script is running without error, but it's doing nothing.

Do you know why? How can I change all the titles and save it as a new fasta file with another name ?

fasta • 1.5k views
ADD COMMENTlink modified 2.2 years ago by shenwei3564.4k • written 2.2 years ago by beausoleilmo230
gravatar for shenwei356
2.2 years ago by
shenwei3564.4k wrote:

It's very simple with SeqKit, a cross-platform and ultrafast toolkit for FASTA/Q file manipulation.

Just one command:

cat scaffolds.fasta | seqkit replace --ignore-case --kv-file <(cut -f 2,4 table.tsv) --pattern "^(\w+)" --replacement "{kv}"


SeqKit subcommand replace can find substrings with regular expression and replace them with strings or corresponding values of found substrings provided by the tab-delimited key-value file.

Therefore, the key-value data are firstly extracted by cut -f 2,4 table.tsv

$ cut -f 2,4 table.tsv
Genome  name
scaffold410     JH739887
scaffold430     JH739888

Then seqkit replace command is used to replace FASTA headers. --pattern "^(\w+)" refers to the to be replaced substrings (e.g., Scaffold410), and the special placeholder symbols {kv} represents the corresponding value of the key.

BTW, if you need similar manipulations of CSV/TSV files, csvtk provides similar functions and features

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by shenwei3564.4k

The only problem here is that you are not getting the same format as the original fasta file.

cat input.fa | seqkit replace --ignore-case --kv-file  <(cut -f 2,4 scaffold_names.txt) --pattern "^(\w+)" --replacement "{kv}" | cat -n

will give you something like this (see the line number):

[INFO] read key-value file: /dev/fd/63
[INFO] 27240 pairs of key-value loaded
     1  >JH761601   275
     7  >JH758574   374

Whereas the original file is organized like this:

 1  >Scaffold410    275
ADD REPLYlink written 2.2 years ago by beausoleilmo230

The solution to this is that:

cat input.fa | seqkit replace --ignore-case --keep-key --line-width 0 --kv-file \
 <(cut -f 2,4 text_repalce.tsv) --pattern "^(\w+)" --replacement "{kv}"

--line-width 0 will not wrap the text!

--keep-key will keep a string that wasn't matched

See seqkit replace -h

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by beausoleilmo230

glad you found it!!!

ADD REPLYlink written 2.2 years ago by shenwei3564.4k

Hi all! I'm having sort of the same problem, but in my case I have keys that must only partially match the name in the fasta file, since the fasta include also numbers, and I need to match by the species code. Example fasta


Example key

ENSETEP    Echinops_telfairi

I tried several things, from including special characters (* and .*) in the key to playing around with the "^(\w+)" expression, in which I suspect the problem lies, to no avail. I'm sure the solution it's close, but I haven't got it yet. Any clues? P.S. It correctly loads the kv, yet it cannot match anything in the fasta file.

Thank you!

ADD REPLYlink modified 13 months ago • written 13 months ago by Solowars40

Also, the info lines

[INFO] read key-value file: /dev/fd/63
[INFO] 27240 pairs of key-value loaded

won't be printed if you decide to save the files! Super handy trick!

ADD REPLYlink written 2.2 years ago by beausoleilmo230
gravatar for Pierre Lindenbaum
2.2 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum116k wrote:

a one-liner, assuming all delimiters are tabulations:

join  -t $'\t' -1 2 -2 1 \
<(cat table.txt | sort -t $'\t' -k2,2) \
<(cat input.fasta | paste - - | cut -c 2- | sed 's/Scaffold/scaffold/' |  sort -t $'\t' -k1,1) | \
awk -F '\t' '{printf(">%s\t%s\n%s\n",$4,$6,$7);}'
ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by Pierre Lindenbaum116k

Can you explain me what is you command doing?

You are basically matching field 2 from file 1 and field 1 from file 2. Inputing the text from a text file sorted by field 2 to 2! Pasting 2 columns - -

Join needs -?

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by beausoleilmo230
(cat table.txt | sort -t $'\t' -k2,2)

sort table on column 2 (scaffold)

cat input.fasta | paste - - | cut -c 2- | sed 's/Scaffold/scaffold/' |  sort -t $'\t' -k1,1

linearize fasta (two lines) ; remove first char ('>') | convert 'Scaffold' to lower case; sort on scaffold

join  -t $'\t' -1 2 -2 1 \

join the processes above using column 2 and 1 and tab as delimiter

  awk -F '\t' '{printf(">%s\t%s\n%s\n",$4,$6,$7);}'

convert back to fasta

ADD REPLYlink written 2.2 years ago by Pierre Lindenbaum116k


I cross validated with a grep by matching the sequence of a scaffold. It worked! scaffold1000 => JH755378


Just for the record!

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by beausoleilmo230

I don't know how we could do that but I lost the information or the options regarding the scaffolds.

head Geospiza_fortis.scaf.noBacterial.fa.fai # (before)
# Scaffold410       275     17      275     276
# Scaffold430       374     310     374     375
# Scaffold1010      597     703     597     598
head Geospiza_fortis.scaf.noBacterial.fa.fai # (after your code...)
# JH739910  0   8342783     -1  -1
# JH739916  0   15380221    -1  -1
# JH739940  0   20636077    -1  -1
# JH755378  0   20636606    -1  -1

This is creating the error: [mplp_func] Skipping because 16632861 is outside of 1 [ref:0] when using mpileup.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by beausoleilmo230

There is an error with the awk command:

awk -F '\t' '{printf(">%s\t%s\n%s\n",$4,$6,$7);}'

is printing something like this:


Where it should print something like this:

>JH750237   468
>JH754437   654

So it's not taking the right column. There is a problem in the split of the scaffold and the number after it...

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by beausoleilmo230
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 815 users visited in the last hour