4
0
Entering edit mode
3.9 years ago
tpaisie ▴ 70

So I have a director full of fasta files and I want to change the fasta header in each one by the name of their corresponding fasta file. For example:

HC1993.fa

> X58834
CCTGCATCTGCAA


HC1993.fa

> HC1993
CCTGCATCTGCAA


I have about 50 fasta files like that in a directory that I was to do the same thing to. I've been using this sed command for one file that works:

sed 's/>.*/>HC1193/' HC1993.fa > new/HC1993.fa


But now I want to loop this command through the directory and this is the command I have been using:

for i in $(ls *.fa | rev | cut -c 4- | rev | uniq) do sed 's/>.*/>${i}/' ${i}.fa > new/${i}.fa
done


This command gives me this for all the new fasta file headers

HC1993.fa

>${i} CCTGCATCTGCAA  Now I know there is a bunch of way to fix this, but could someone help me fix the bash loop I made? I want to learn my incorrect command and now to fix it. Thanks! sequence fasta bash loop sed unix • 6.3k views ADD COMMENT 1 Entering edit mode 3.9 years ago Example fasta: $ cat HC1993.fa
>X58834
CCTGCATCTGCAA


Expected output (assumption is that first line in each fasta file is fasta header):

$cat HC1993.fa >HC1993 CCTGCATCTGCAA  in bash: $ for i in *.fa; do sed "1s/.*/>${i%.fa}/"$i; done
>HC1993
CCTGCATCTGCAA


using GNU-parallel:

$parallel 'sed "1s/.*/>{.}/" {}' ::: *.fa >HC1993 CCTGCATCTGCAA  ADD COMMENT 0 Entering edit mode Ohh thank you so much that worked!!!! ADD REPLY 0 Entering edit mode For future reference, code can be further shorted by: $ parallel  'sed "/^>/ c {.}" {}' ::: *.fa

0
Entering edit mode
3.9 years ago
Joe 19k

As I understand it, you just want to make the header of the file, the filename?

e.g. given:

~/test/seqs$ls seq1.fasta seq2.fasta seq3.fasta ~/test/seqs$ cat seq*
>tpg|Magnaporthiopsis_incrustans|JF414846
ACTGTAGTAGCTACGATCGATCAGATGATCACGTAGCATCGATCGATCATCGACTAGTAGATCACTCGACATAGATCCACATCAATAGATCATCATCATCATAATCGATCACTAGCAGC
>tpg|Pyricularia_pennisetigena|AB818016
GCAAGNTTCATGACGATGTAGAATGGCTTATCGAAGGGAGCAGGCCAGGGATTGAGGTCCGTCTCACGGGTTGGCTTCACTCCCCCACTGCCAGCCCTCTTGCTGCAACTCCACCAGAA
>tpg|Inocybe_sororia|EU525947
AACCANGCCGCGACGGCGGTGCGATCGGGAAACGCGGCGGTGGCGGAGGAATCGGCCATCCTTCACCATATCGGCCAAGGATTGTGGTTCCTGTAGGGCTCGCGCAGCCCAGGACGCGC


So, for file in *.fasta ; do sed -i "s/^>.*/>"${file%.*}"/gi"$file; done

Yeilds:

~/test/seqs$for file in *.fasta ; do sed -i "s/^>.*/>"${file%.*}"/gi" "$file"; done >seq1 ACTGTAGTAGCTACGATCGATCAGATGATCACGTAGCATCGATCGATCATCGACTAGTAGATCACTCGACATAGATCCACATCAATAGATCATCATCATCATAATCGATCACTAGCAGC >seq2 GCAAGNTTCATGACGATGTAGAATGGCTTATCGAAGGGAGCAGGCCAGGGATTGAGGTCCGTCTCACGGGTTGGCTTCACTCCCCCACTGCCAGCCCTCTTGCTGCAACTCCACCAGAA >seq3 AACCANGCCGCGACGGCGGTGCGATCGGGAAACGCGGCGGTGGCGGAGGAATCGGCCATCCTTCACCATATCGGCCAAGGATTGTGGTTCCTGTAGGGCTCGCGCAGCCCAGGACGCGC  ADD COMMENT 0 Entering edit mode So yes your interpretation of what I would like is correct. Although I used your command and I'm getting this as an error: sed: 1: "HC1993.fa": extra characters at the end of H command And it is not making the new fasta files with the new headers. ADD REPLY 0 Entering edit mode Are you using Mac OS? ADD REPLY 0 Entering edit mode 3.9 years ago try changing 'sed 's/>.*/>${i}/' to sed "s/>.*/>${i}/". ADD COMMENT 0 Entering edit mode 18 months ago Anna ▴ 120 I worked out another solution using a combination of AWK, SED and Perl. This solution works for single files, where each file has one header and the goal is to replace the header with a modified version of the file name. Assuming all your fasta files in the current directory end in ".fa", run the following: ls -lrt \| grep ".fa$" \|
awk '{OFS="";print "var=",$NF"\nsed -i'.ori' -e \"s/>.*/>$var/g\" "$NF}' \| perl -lne 'if ($_=~/^(var=.*)\.fa/) {print \$1} else {print}' > run_command.sh


Explanation:

The first line ls -lrt does a list of the files in your dir.

The second line keeps only those lines of your list that end in .fa

The third and fourth lines, this one gets a bit tricky, are an AWK and Perl commands. The AWK command that prints what the command will look like. In this case, there are two lines of code that will be printed. The first one assigns the name to a variable called var and the second part of the awk command replaces the line with > in your original fasta file with the name of the file: for example if the file name is file1.fasta is the fasta header will be >file1 . The Perl command cleans things up.

The difference of this solution with others is that it prints the commands that are going to execute the name change into a file. In this way, it is possible to: i) make sure the names are changing for what you really want; ii) you keep a record of what happened. After you have checked that the name changes in the file run_command.sh are what you want, you can just:

bash run_command.sh


and it should do the trick.

NB: mind the -i'.ori' option. This is specific for macOS as it won't overwrite the file as other GNU distributions of sed would do. Ah, and if you're copy-pasting the command, you may want to delete the \ characters before each |`.

This solution is far from elegant, but works.