Question: Replace names in FASTA file with a known character string from a text file
3
gravatar for beausoleilmo
24 months ago by
beausoleilmo190
McGill University
beausoleilmo190 wrote:

I have a code that is creating me a script where I can change the name of the titles of the sequences in a FASTA file.

This is the text file I'm using:

#Assembly   Genome Center name  RefSeq Accession.version    GenBank Accession.version   NCBI name
GeoFor_1.0  scaffold40  NW_005054297    JH739887    GPS_002009865
GeoFor_1.0  scaffold112 NW_005054298    JH739888    GPS_002009866
GeoFor_1.0  scaffold41  NW_005054299    JH739889    GPS_002009867
GeoFor_1.0  scaffold130 NW_005054300    JH739890    GPS_002009868
GeoFor_1.0  scaffold54  NW_005054301    JH739891    GPS_002009869
GeoFor_1.0  scaffold16  NW_005054302    JH739892    GPS_002009870

This is the FASTA file that I'm using and that I want to change the names. AS you can see, I want to find the scaffold names that match the different JH######.

>Scaffold410    275
TGCATTAATATGAGTGTGTGCTGCAAAAGTTCAGGTCATGGTCCGATCATACTTCACATTTTGGTAGCACTTTAAGCAGAGATCGGTTATCCCATTCTGTGGAAGACTCAACACTATCATAAGGTCCCACAGTTTTATTATCCCTCTGCCTCCCGGAATGCCCCCGGCAGTGAGGGGTACCATCTTCTCAGCAGTAAGGATATTCTTCAGGAGTTCCGTGTGAGCTTTCCCGGATTTAGTTCCATTTTTTAAATACTTCCCAATTCTTTGCTTTG
>Scaffold430    374
CTTTGTTAACTGAAAGAGCCTCTAAGTAGATGACCAGTGCTCAGTTAGTACAGTATGAATTTTGTTTAATGGAACAGGAAGATTTAGTATTGAGAAGCGGTTAAGGGTTTAACCCAGCCTCCTGTCTGAATGGACCTGAAGAGGGGGGCCGGGAAGAAACCCATGACTGCATTAAAGTGATAGATCTCCAGACATGGGCTAGGGAAGATTTACAAGACACTCCCTGGCCTGAGGGAGAAAATATGTTTATTGATGAGTCTTCAAGGGTGGCAGAAGGGAAGCGATTTACAGGATACACAATCATTAATGGAAGGAAATTAAAGGAAGGGGGGAGATTGTCACCCACCTGGTCAGTTCAGACAGCAGAGCTGTAT
>Scaffold1010   597
GGAACACACCTGGGCACACCTGGATGGAGCAGGAACACACCTGGATGGGGTTAGGACACATCTGGATGGCGTTGGGACACACCTGGATGCGCTCAGGGTACACCTG...

Thesis the command I use to create a script to change the names

tail -n +2 scaffold_names_2.txt | while read assemb gcenter refseq genbank ncbi; do echo -ne "sed 's/[[:<:]]$gcenter[[:>:]]/$genbank/g' | " >>script.sh; done

The problem is that I'm not able to save the fasta file with the new names.

This is the last line of my script:

... sed 's/[[:<:]]scaffold4469[[:>:]]/JH767125/g'  name.fasta.fa

The script is running without error, but it's doing nothing.

Do you know why? How can I change all the titles and save it as a new fasta file with another name ?

fasta • 1.4k views
ADD COMMENTlink modified 24 months ago by shenwei3564.2k • written 24 months ago by beausoleilmo190
5
gravatar for shenwei356
24 months ago by
shenwei3564.2k
China
shenwei3564.2k wrote:

It's very simple with SeqKit, a cross-platform and ultrafast toolkit for FASTA/Q file manipulation.

Just one command:

cat scaffolds.fasta | seqkit replace --ignore-case --kv-file <(cut -f 2,4 table.tsv) --pattern "^(\w+)" --replacement "{kv}"

Explanation:

SeqKit subcommand replace can find substrings with regular expression and replace them with strings or corresponding values of found substrings provided by the tab-delimited key-value file.

Therefore, the key-value data are firstly extracted by cut -f 2,4 table.tsv

$ cut -f 2,4 table.tsv
Genome  name
scaffold410     JH739887
scaffold430     JH739888

Then seqkit replace command is used to replace FASTA headers. --pattern "^(\w+)" refers to the to be replaced substrings (e.g., Scaffold410), and the special placeholder symbols {kv} represents the corresponding value of the key.

BTW, if you need similar manipulations of CSV/TSV files, csvtk provides similar functions and features

ADD COMMENTlink modified 24 months ago • written 24 months ago by shenwei3564.2k

The only problem here is that you are not getting the same format as the original fasta file.

cat input.fa | seqkit replace --ignore-case --kv-file  <(cut -f 2,4 scaffold_names.txt) --pattern "^(\w+)" --replacement "{kv}" | cat -n

will give you something like this (see the line number):

[INFO] read key-value file: /dev/fd/63
[INFO] 27240 pairs of key-value loaded
     1  >JH761601   275
     2  TGCATTAATATGAGTGTGTGCTGCAAAAGTTCAGGTCATGGTCCGATCATACTTCACATT
     3  TTGGTAGCACTTTAAGCAGAGATCGGTTATCCCATTCTGTGGAAGACTCAACACTATCAT
     4  AAGGTCCCACAGTTTTATTATCCCTCTGCCTCCCGGAATGCCCCCGGCAGTGAGGGGTAC
     5  CATCTTCTCAGCAGTAAGGATATTCTTCAGGAGTTCCGTGTGAGCTTTCCCGGATTTAGT
     6  TCCATTTTTTAAATACTTCCCAATTCTTTGCTTTG
     7  >JH758574   374
     8  CTTTGTTAACTGAAAGAGCCTCTAAGTAGATGACCAGTGCTCAGTTAGTACAGTATGAAT
     9  TTTGTTTAATGGAACAGGAAGATTTAGTATTGAGAAGCGGTTAAGGGTTTAACCCAGCCT

Whereas the original file is organized like this:

 1  >Scaffold410    275
 2  TGCATTAATATGAGTGTGTGCTGCAAAAGTTCAGGTCATGGTCCGATCATACTTCACATTTTGGTAGCACTTTAAGCAGAGATCGGTTATCCCATTCTGTGGAAGACTCAACACTATCATAAGGTCCCACAGTTTTATTATCCCTCTGCCTCCCGGAATGCCCCCGGCAGTGAGGGGTACCATCTTCTCAGCAGTAAGGATATTCTTCAGGAGTTCCGTGTGAGCTTTCCCGGATTTAGTTCCATTTTTTAAATACTTCCCAATTCTTTGCTTTG
ADD REPLYlink written 24 months ago by beausoleilmo190
1

The solution to this is that:

cat input.fa | seqkit replace --ignore-case --keep-key --line-width 0 --kv-file \
 <(cut -f 2,4 text_repalce.tsv) --pattern "^(\w+)" --replacement "{kv}"

--line-width 0 will not wrap the text!

--keep-key will keep a string that wasn't matched

See seqkit replace -h

ADD REPLYlink modified 24 months ago • written 24 months ago by beausoleilmo190

glad you found it!!!

ADD REPLYlink written 24 months ago by shenwei3564.2k

Hi all! I'm having sort of the same problem, but in my case I have keys that must only partially match the name in the fasta file, since the fasta include also numbers, and I need to match by the species code. Example fasta

>ENSETEP00040504
ATGGTCAGCTAGCACGAGGCAGGAGCGAGAGCAGAG...
...

Example key

ENSETEP    Echinops_telfairi
...

I tried several things, from including special characters (* and .*) in the key to playing around with the "^(\w+)" expression, in which I suspect the problem lies, to no avail. I'm sure the solution it's close, but I haven't got it yet. Any clues? P.S. It correctly loads the kv, yet it cannot match anything in the fasta file.

Thank you!

ADD REPLYlink modified 10 months ago • written 10 months ago by Solowars40
1

Also, the info lines

[INFO] read key-value file: /dev/fd/63
[INFO] 27240 pairs of key-value loaded

won't be printed if you decide to save the files! Super handy trick!

ADD REPLYlink written 24 months ago by beausoleilmo190
3
gravatar for Pierre Lindenbaum
24 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum114k wrote:

a one-liner, assuming all delimiters are tabulations:

join  -t $'\t' -1 2 -2 1 \
<(cat table.txt | sort -t $'\t' -k2,2) \
<(cat input.fasta | paste - - | cut -c 2- | sed 's/Scaffold/scaffold/' |  sort -t $'\t' -k1,1) | \
awk -F '\t' '{printf(">%s\t%s\n%s\n",$4,$6,$7);}'
ADD COMMENTlink modified 24 months ago • written 24 months ago by Pierre Lindenbaum114k

Can you explain me what is you command doing?

You are basically matching field 2 from file 1 and field 1 from file 2. Inputing the text from a text file sorted by field 2 to 2! Pasting 2 columns - -

Join needs -?

ADD REPLYlink modified 24 months ago • written 24 months ago by beausoleilmo190
1
(cat table.txt | sort -t $'\t' -k2,2)

sort table on column 2 (scaffold)

cat input.fasta | paste - - | cut -c 2- | sed 's/Scaffold/scaffold/' |  sort -t $'\t' -k1,1

linearize fasta (two lines) ; remove first char ('>') | convert 'Scaffold' to lower case; sort on scaffold

join  -t $'\t' -1 2 -2 1 \

join the processes above using column 2 and 1 and tab as delimiter

  awk -F '\t' '{printf(">%s\t%s\n%s\n",$4,$6,$7);}'

convert back to fasta

ADD REPLYlink written 24 months ago by Pierre Lindenbaum114k

Thanks!!!

I cross validated with a grep by matching the sequence of a scaffold. It worked! scaffold1000 => JH755378

grep -A 10 -B 10 AAGCTTTGCAGAAATAAAAGAAATTTGTTAAGCGGCAAGATCGTATGGAAGAGAGTCAGCATATTTCAGAGATCTATTACAAGCCACCTTTTCAGCCAATGTATTCACCCCTCATGACCTAAGGCAACTTTGTACTACATCTTTGCTGTCTCCGACAGGGTATTCTCTGTGGGAGCTGGCATGGAAGCGTTTATTAAATGACCTTTTAAGAGGTTATGCCAGAACTCCAGCCACTGCAAATATAACTCAAGTTCAGCTTGCTGGGGAAGGAGCCTACATAGATCCACAAAGACAGGCAAGGGCTCCTAATAGACGAATCCTTCAGGATATTAAGGAAGCAGCAAGGACTGCCCTGCTCCAGGTGCCTGATGGGAATAAACCAAAATTAGACTTCACAGAAATTAAACAAGGGATGGATGAGCCATACTCTGTGACGGTGGTCACAGGGGTCTTAGGATGAGGGAAGAGACGTAGATCTGACTGCATGTTACAGAAGGCTTGATTTATCATTTTATGA Geospiza_fortis.scaf.noBacterial.fa

Just for the record!

ADD REPLYlink modified 24 months ago • written 24 months ago by beausoleilmo190

I don't know how we could do that but I lost the information or the options regarding the scaffolds.

head Geospiza_fortis.scaf.noBacterial.fa.fai # (before)
# Scaffold410       275     17      275     276
# Scaffold430       374     310     374     375
# Scaffold1010      597     703     597     598
head Geospiza_fortis.scaf.noBacterial.fa.fai # (after your code...)
# JH739910  0   8342783     -1  -1
# JH739916  0   15380221    -1  -1
# JH739940  0   20636077    -1  -1
# JH755378  0   20636606    -1  -1

This is creating the error: [mplp_func] Skipping because 16632861 is outside of 1 [ref:0] when using mpileup.

ADD REPLYlink modified 24 months ago • written 24 months ago by beausoleilmo190

There is an error with the awk command:

awk -F '\t' '{printf(">%s\t%s\n%s\n",$4,$6,$7);}'

is printing something like this:

>JH750237   GGCTTCCCTGTCCCCACTTTGGGGCTCACCTGAACCCACCTGTACTCACCTGGGGCTTCCCTGAGCCCATCTGAGCTCCCCTTTCTCCCCTTTGGGGTTCCCTGAACCCACCTGGGCTCACCTGAGCCCACCTGGGGGTTCCCTGAGCTCACCTGAGCGTTACCTGTCCCCCCATTGGGGTTCTCTGAGCTCACCTGAACCCACCTGAGCTCACCTGGGGCTTCCCTGTCCCCACTTTGGGGCTCTCTGAACCCACCTGAGCTCACCTGTGCTCACCTGGGGGTTCCCTGAGCTCACCGGAACCCACCTGTGCTCACCTGAGGGCTCACCTGAACCCACCTGAGCTCCCCTTCCTGCCCCTTGGGGTTCCCTGAACCCACCTGAGCTCACCTGGGGCTTCCCTGAGCCCACTGACGTCACCTGAACCCACCTGGGGCTTCCCTGTCCCCACTTTGGGGTTCCCTGAGCTCACCTGAGCCCACCTGTGCCCACTTTGGGGTTCTCTGAACCCACCTGAGTTCACCTGGGGGTTCCCCAAGCTCACCTGAGCCCACCTGAGCTCACCTGAGCGTTACCTGTCTCCACTTTGGGGTTCCCTGAGCTCACCTGAACCCACCTGTGCCCACTTTGGGGCTCTCTGAACCCACCCGAGCTCACCTGAGCCCACCTGGGGGTTCCCCGGGGCTCTCACCTGCGCCAGGTGCGCGCCCTGCGGCGGCAGCAGCGCATGATCAAGAACCGCGAGTCGGCCAGCGCGTCGCGGCGCCGGCGCAAGGAATACCTGGAGCACCTGGAGAGGAGCCTGAGCCTGGCGCTGGCC

Where it should print something like this:

>JH750237   468
GGCTTCCCTGTCCCCACTTTGGGGCTCACCTGAACCCACCTGTACTCACCTGGGGCTTCCCTGAGCCCATCTGAGCTCCCCTTTCTCCCCTTTGGGGTTCCCTGAACCCACCTGGGCTCACCTGAGCCCACCTGGGGGTTCCCTGAGCTCACCTGAGCGTTACCTGTCCCCCCATTGGGGTTCTCTGAGCTCACCTGAACCCACCTGAGCTCACCTGGGGCTTCCCTGTCCCCACTTTGGGGCTCTCTGAACCCACCTGAGCTCACCTGTGCTCACCTGGGGGTTCCCTGAGCTCACCGGAACCCACCTGTGCTCACCTGAGGGCTCACCTGAACCCACCTGAGCTCCCCTTCCTGCCCCTTGGGGTTCCCTGAACCCACCTGAGCTCACCTGGGGCTTCCCTGAGCCCACTGACGTCACCTGAACCCACCTGGGGCTTCCCTGTCCCCACTTTGGGGTTCCCTGAGCTCACCTGAGCCCACCTGTGCCCACTTTGGGGTTCTCTGAACCCACCTGAGTTCACCTGGGGGTTCCCCAAGCTCACCTGAGCCCACCTGAGCTCACCTGAGCGTTACCTGTCTCCACTTTGGGGTTCCCTGAGCTCACCTGAACCCACCTGTGCCCACTTTGGGGCTCTCTGAACCCACCCGAGCTCACCTGAGCCCACCTGGGGGTTCCCCGGGGCTCTCACCTGCGCCAGGTGCGCGCCCTGCGGCGGCAGCAGCGCATGATCAAGAACCGCGAGTCGGCCAGCGCGTCGCGGCGCCGGCGCAAGGAATACCTGGAGCACCTGGAGAGGAGCCTGAGCCTGGCGCTGGCC
>JH754437   654
CCTGAAAGCTGGTTTCACCATCAAGAAAAGCAAAGTTAATAAATCAGCTTGAGAGATCCAGTTCCTGGGAGTAGAGTGGCAAGATGGACAGTGTCAGATTCCCACCGATGTCATCAACAAGATCACAGCAATGTCTCCACCAACCAACAAGAAGGAAACAGAAGCTTTCCTTGGTGCCATAGGCTTTTGGAGAATGCACATCCCTGAGTACAGCCAGATTGTGAACTATTGGGATTACTACCAATATAGCTGGATGGGACGTGCCTGGACTTGTCTGGCCCTTGCTGCTTGACCAAAGGCAGCAACTGTGGTGATTTCACCTCAGAGACGCCTTTGGCTATCTGGATGTTGCAACTGGGTGCTTGGAGGCAAGTCCAGACATGCAGGAGCTCAGGTTTACCTCTGGTGGCGAAGCTTTAAGGCGAGGATGAAGGAAAGGAGTCGGTTCTGCTGGAGAGTTTCGATCCAGAGCTTTATTTCTGACCCACAGGCCTCTGAATCCATTAATTGCTCTAACAGAACTTCCCGAACCGCGTGGTTGCTGTTCCTTTTAA

So it's not taking the right column. There is a problem in the split of the scaffold and the number after it...

ADD REPLYlink modified 24 months ago • written 24 months ago by beausoleilmo190
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1973 users visited in the last hour