Replace names in FASTA file with a known character string from a text file
2
3
Entering edit mode
7.4 years ago
beausoleilmo ▴ 580

I have a code that is creating me a script where I can change the name of the titles of the sequences in a FASTA file.

This is the text file I'm using:

#Assembly   Genome Center name  RefSeq Accession.version    GenBank Accession.version   NCBI name
GeoFor_1.0  scaffold40  NW_005054297    JH739887    GPS_002009865
GeoFor_1.0  scaffold112 NW_005054298    JH739888    GPS_002009866
GeoFor_1.0  scaffold41  NW_005054299    JH739889    GPS_002009867
GeoFor_1.0  scaffold130 NW_005054300    JH739890    GPS_002009868
GeoFor_1.0  scaffold54  NW_005054301    JH739891    GPS_002009869
GeoFor_1.0  scaffold16  NW_005054302    JH739892    GPS_002009870

This is the FASTA file that I'm using and that I want to change the names. AS you can see, I want to find the scaffold names that match the different JH######.

>Scaffold410    275
TGCATTAATATGAGTGTGTGCTGCAAAAGTTCAGGTCATGGTCCGATCATACTTCACATTTTGGTAGCACTTTAAGCAGAGATCGGTTATCCCATTCTGTGGAAGACTCAACACTATCATAAGGTCCCACAGTTTTATTATCCCTCTGCCTCCCGGAATGCCCCCGGCAGTGAGGGGTACCATCTTCTCAGCAGTAAGGATATTCTTCAGGAGTTCCGTGTGAGCTTTCCCGGATTTAGTTCCATTTTTTAAATACTTCCCAATTCTTTGCTTTG
>Scaffold430    374
CTTTGTTAACTGAAAGAGCCTCTAAGTAGATGACCAGTGCTCAGTTAGTACAGTATGAATTTTGTTTAATGGAACAGGAAGATTTAGTATTGAGAAGCGGTTAAGGGTTTAACCCAGCCTCCTGTCTGAATGGACCTGAAGAGGGGGGCCGGGAAGAAACCCATGACTGCATTAAAGTGATAGATCTCCAGACATGGGCTAGGGAAGATTTACAAGACACTCCCTGGCCTGAGGGAGAAAATATGTTTATTGATGAGTCTTCAAGGGTGGCAGAAGGGAAGCGATTTACAGGATACACAATCATTAATGGAAGGAAATTAAAGGAAGGGGGGAGATTGTCACCCACCTGGTCAGTTCAGACAGCAGAGCTGTAT
>Scaffold1010   597
GGAACACACCTGGGCACACCTGGATGGAGCAGGAACACACCTGGATGGGGTTAGGACACATCTGGATGGCGTTGGGACACACCTGGATGCGCTCAGGGTACACCTG...

Thesis the command I use to create a script to change the names

tail -n +2 scaffold_names_2.txt | while read assemb gcenter refseq genbank ncbi; do echo -ne "sed 's/[[:<:]]$gcenter[[:>:]]/$genbank/g' | " >>script.sh; done

The problem is that I'm not able to save the fasta file with the new names.

This is the last line of my script:

... sed 's/[[:<:]]scaffold4469[[:>:]]/JH767125/g'  name.fasta.fa

The script is running without error, but it's doing nothing.

Do you know why? How can I change all the titles and save it as a new fasta file with another name ?

FASTA • 9.7k views
ADD COMMENT
5
Entering edit mode
7.4 years ago

It's very simple with SeqKit, a cross-platform and ultrafast toolkit for FASTA/Q file manipulation.

Just one command:

cat scaffolds.fasta | seqkit replace --ignore-case --kv-file <(cut -f 2,4 table.tsv) --pattern "^(\w+)" --replacement "{kv}"

Explanation:

SeqKit subcommand replace can find substrings with regular expression and replace them with strings or corresponding values of found substrings provided by the tab-delimited key-value file.

Therefore, the key-value data are firstly extracted by cut -f 2,4 table.tsv

$ cut -f 2,4 table.tsv
Genome  name
scaffold410     JH739887
scaffold430     JH739888

Then seqkit replace command is used to replace FASTA headers. --pattern "^(\w+)" refers to the to be replaced substrings (e.g., Scaffold410), and the special placeholder symbols {kv} represents the corresponding value of the key.

BTW, if you need similar manipulations of CSV/TSV files, csvtk provides similar functions and features

ADD COMMENT
0
Entering edit mode

The only problem here is that you are not getting the same format as the original fasta file.

cat input.fa | seqkit replace --ignore-case --kv-file  <(cut -f 2,4 scaffold_names.txt) --pattern "^(\w+)" --replacement "{kv}" | cat -n

will give you something like this (see the line number):

[INFO] read key-value file: /dev/fd/63
[INFO] 27240 pairs of key-value loaded
     1  >JH761601   275
     2  TGCATTAATATGAGTGTGTGCTGCAAAAGTTCAGGTCATGGTCCGATCATACTTCACATT
     3  TTGGTAGCACTTTAAGCAGAGATCGGTTATCCCATTCTGTGGAAGACTCAACACTATCAT
     4  AAGGTCCCACAGTTTTATTATCCCTCTGCCTCCCGGAATGCCCCCGGCAGTGAGGGGTAC
     5  CATCTTCTCAGCAGTAAGGATATTCTTCAGGAGTTCCGTGTGAGCTTTCCCGGATTTAGT
     6  TCCATTTTTTAAATACTTCCCAATTCTTTGCTTTG
     7  >JH758574   374
     8  CTTTGTTAACTGAAAGAGCCTCTAAGTAGATGACCAGTGCTCAGTTAGTACAGTATGAAT
     9  TTTGTTTAATGGAACAGGAAGATTTAGTATTGAGAAGCGGTTAAGGGTTTAACCCAGCCT

Whereas the original file is organized like this:

 1  >Scaffold410    275
 2  TGCATTAATATGAGTGTGTGCTGCAAAAGTTCAGGTCATGGTCCGATCATACTTCACATTTTGGTAGCACTTTAAGCAGAGATCGGTTATCCCATTCTGTGGAAGACTCAACACTATCATAAGGTCCCACAGTTTTATTATCCCTCTGCCTCCCGGAATGCCCCCGGCAGTGAGGGGTACCATCTTCTCAGCAGTAAGGATATTCTTCAGGAGTTCCGTGTGAGCTTTCCCGGATTTAGTTCCATTTTTTAAATACTTCCCAATTCTTTGCTTTG
ADD REPLY
3
Entering edit mode

The solution to this is that:

cat input.fa | seqkit replace --ignore-case --keep-key --line-width 0 --kv-file \
 <(cut -f 2,4 text_repalce.tsv) --pattern "^(\w+)" --replacement "{kv}"

--line-width 0 will not wrap the text!

--keep-key will keep a string that wasn't matched

See seqkit replace -h

ADD REPLY
0
Entering edit mode

glad you found it!!!

ADD REPLY
0
Entering edit mode

Hi all! I'm having sort of the same problem, but in my case I have keys that must only partially match the name in the fasta file, since the fasta include also numbers, and I need to match by the species code. Example fasta

>ENSETEP00040504
ATGGTCAGCTAGCACGAGGCAGGAGCGAGAGCAGAG...
...

Example key

ENSETEP    Echinops_telfairi
...

I tried several things, from including special characters (* and .*) in the key to playing around with the "^(\w+)" expression, in which I suspect the problem lies, to no avail. I'm sure the solution it's close, but I haven't got it yet. Any clues? P.S. It correctly loads the kv, yet it cannot match anything in the fasta file.

Thank you!

ADD REPLY
1
Entering edit mode

Also, the info lines

[INFO] read key-value file: /dev/fd/63
[INFO] 27240 pairs of key-value loaded

won't be printed if you decide to save the files! Super handy trick!

ADD REPLY
3
Entering edit mode
7.4 years ago

a one-liner, assuming all delimiters are tabulations:

join  -t $'\t' -1 2 -2 1 \
<(cat table.txt | sort -t $'\t' -k2,2) \
<(cat input.fasta | paste - - | cut -c 2- | sed 's/Scaffold/scaffold/' |  sort -t $'\t' -k1,1) | \
awk -F '\t' '{printf(">%s\t%s\n%s\n",$4,$6,$7);}'
ADD COMMENT
0
Entering edit mode

Can you explain me what is you command doing?

You are basically matching field 2 from file 1 and field 1 from file 2. Inputing the text from a text file sorted by field 2 to 2! Pasting 2 columns - -

Join needs -?

ADD REPLY
1
Entering edit mode
(cat table.txt | sort -t $'\t' -k2,2)

sort table on column 2 (scaffold)

cat input.fasta | paste - - | cut -c 2- | sed 's/Scaffold/scaffold/' |  sort -t $'\t' -k1,1

linearize fasta (two lines) ; remove first char ('>') | convert 'Scaffold' to lower case; sort on scaffold

join  -t $'\t' -1 2 -2 1 \

join the processes above using column 2 and 1 and tab as delimiter

  awk -F '\t' '{printf(">%s\t%s\n%s\n",$4,$6,$7);}'

convert back to fasta

ADD REPLY
0
Entering edit mode

Thanks!!!

I cross validated with a grep by matching the sequence of a scaffold. It worked! scaffold1000 => JH755378

grep -A 10 -B 10 AAGCTTTGCAGAAATAAAAGAAATTTGTTAAGCGGCAAGATCGTATGGAAGAGAGTCAGCATATTTCAGAGATCTATTACAAGCCACCTTTTCAGCCAATGTATTCACCCCTCATGACCTAAGGCAACTTTGTACTACATCTTTGCTGTCTCCGACAGGGTATTCTCTGTGGGAGCTGGCATGGAAGCGTTTATTAAATGACCTTTTAAGAGGTTATGCCAGAACTCCAGCCACTGCAAATATAACTCAAGTTCAGCTTGCTGGGGAAGGAGCCTACATAGATCCACAAAGACAGGCAAGGGCTCCTAATAGACGAATCCTTCAGGATATTAAGGAAGCAGCAAGGACTGCCCTGCTCCAGGTGCCTGATGGGAATAAACCAAAATTAGACTTCACAGAAATTAAACAAGGGATGGATGAGCCATACTCTGTGACGGTGGTCACAGGGGTCTTAGGATGAGGGAAGAGACGTAGATCTGACTGCATGTTACAGAAGGCTTGATTTATCATTTTATGA Geospiza_fortis.scaf.noBacterial.fa

Just for the record!

ADD REPLY
0
Entering edit mode

I don't know how we could do that but I lost the information or the options regarding the scaffolds.

head Geospiza_fortis.scaf.noBacterial.fa.fai # (before)
# Scaffold410       275     17      275     276
# Scaffold430       374     310     374     375
# Scaffold1010      597     703     597     598
head Geospiza_fortis.scaf.noBacterial.fa.fai # (after your code...)
# JH739910  0   8342783     -1  -1
# JH739916  0   15380221    -1  -1
# JH739940  0   20636077    -1  -1
# JH755378  0   20636606    -1  -1

This is creating the error: [mplp_func] Skipping because 16632861 is outside of 1 [ref:0] when using mpileup.

ADD REPLY
0
Entering edit mode

There is an error with the awk command:

awk -F '\t' '{printf(">%s\t%s\n%s\n",$4,$6,$7);}'

is printing something like this:

>JH750237   GGCTTCCCTGTCCCCACTTTGGGGCTCACCTGAACCCACCTGTACTCACCTGGGGCTTCCCTGAGCCCATCTGAGCTCCCCTTTCTCCCCTTTGGGGTTCCCTGAACCCACCTGGGCTCACCTGAGCCCACCTGGGGGTTCCCTGAGCTCACCTGAGCGTTACCTGTCCCCCCATTGGGGTTCTCTGAGCTCACCTGAACCCACCTGAGCTCACCTGGGGCTTCCCTGTCCCCACTTTGGGGCTCTCTGAACCCACCTGAGCTCACCTGTGCTCACCTGGGGGTTCCCTGAGCTCACCGGAACCCACCTGTGCTCACCTGAGGGCTCACCTGAACCCACCTGAGCTCCCCTTCCTGCCCCTTGGGGTTCCCTGAACCCACCTGAGCTCACCTGGGGCTTCCCTGAGCCCACTGACGTCACCTGAACCCACCTGGGGCTTCCCTGTCCCCACTTTGGGGTTCCCTGAGCTCACCTGAGCCCACCTGTGCCCACTTTGGGGTTCTCTGAACCCACCTGAGTTCACCTGGGGGTTCCCCAAGCTCACCTGAGCCCACCTGAGCTCACCTGAGCGTTACCTGTCTCCACTTTGGGGTTCCCTGAGCTCACCTGAACCCACCTGTGCCCACTTTGGGGCTCTCTGAACCCACCCGAGCTCACCTGAGCCCACCTGGGGGTTCCCCGGGGCTCTCACCTGCGCCAGGTGCGCGCCCTGCGGCGGCAGCAGCGCATGATCAAGAACCGCGAGTCGGCCAGCGCGTCGCGGCGCCGGCGCAAGGAATACCTGGAGCACCTGGAGAGGAGCCTGAGCCTGGCGCTGGCC

Where it should print something like this:

>JH750237   468
GGCTTCCCTGTCCCCACTTTGGGGCTCACCTGAACCCACCTGTACTCACCTGGGGCTTCCCTGAGCCCATCTGAGCTCCCCTTTCTCCCCTTTGGGGTTCCCTGAACCCACCTGGGCTCACCTGAGCCCACCTGGGGGTTCCCTGAGCTCACCTGAGCGTTACCTGTCCCCCCATTGGGGTTCTCTGAGCTCACCTGAACCCACCTGAGCTCACCTGGGGCTTCCCTGTCCCCACTTTGGGGCTCTCTGAACCCACCTGAGCTCACCTGTGCTCACCTGGGGGTTCCCTGAGCTCACCGGAACCCACCTGTGCTCACCTGAGGGCTCACCTGAACCCACCTGAGCTCCCCTTCCTGCCCCTTGGGGTTCCCTGAACCCACCTGAGCTCACCTGGGGCTTCCCTGAGCCCACTGACGTCACCTGAACCCACCTGGGGCTTCCCTGTCCCCACTTTGGGGTTCCCTGAGCTCACCTGAGCCCACCTGTGCCCACTTTGGGGTTCTCTGAACCCACCTGAGTTCACCTGGGGGTTCCCCAAGCTCACCTGAGCCCACCTGAGCTCACCTGAGCGTTACCTGTCTCCACTTTGGGGTTCCCTGAGCTCACCTGAACCCACCTGTGCCCACTTTGGGGCTCTCTGAACCCACCCGAGCTCACCTGAGCCCACCTGGGGGTTCCCCGGGGCTCTCACCTGCGCCAGGTGCGCGCCCTGCGGCGGCAGCAGCGCATGATCAAGAACCGCGAGTCGGCCAGCGCGTCGCGGCGCCGGCGCAAGGAATACCTGGAGCACCTGGAGAGGAGCCTGAGCCTGGCGCTGGCC
>JH754437   654
CCTGAAAGCTGGTTTCACCATCAAGAAAAGCAAAGTTAATAAATCAGCTTGAGAGATCCAGTTCCTGGGAGTAGAGTGGCAAGATGGACAGTGTCAGATTCCCACCGATGTCATCAACAAGATCACAGCAATGTCTCCACCAACCAACAAGAAGGAAACAGAAGCTTTCCTTGGTGCCATAGGCTTTTGGAGAATGCACATCCCTGAGTACAGCCAGATTGTGAACTATTGGGATTACTACCAATATAGCTGGATGGGACGTGCCTGGACTTGTCTGGCCCTTGCTGCTTGACCAAAGGCAGCAACTGTGGTGATTTCACCTCAGAGACGCCTTTGGCTATCTGGATGTTGCAACTGGGTGCTTGGAGGCAAGTCCAGACATGCAGGAGCTCAGGTTTACCTCTGGTGGCGAAGCTTTAAGGCGAGGATGAAGGAAAGGAGTCGGTTCTGCTGGAGAGTTTCGATCCAGAGCTTTATTTCTGACCCACAGGCCTCTGAATCCATTAATTGCTCTAACAGAACTTCCCGAACCGCGTGGTTGCTGTTCCTTTTAA

So it's not taking the right column. There is a problem in the split of the scaffold and the number after it...

ADD REPLY

Login before adding your answer.

Traffic: 1700 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6