Changing names of Fasta headers
4
0
Entering edit mode
6.1 years ago
tpaisie ▴ 80

So I have a director full of fasta files and I want to change the fasta header in each one by the name of their corresponding fasta file. For example:

HC1993.fa

> X58834
CCTGCATCTGCAA

HC1993.fa

> HC1993
CCTGCATCTGCAA

I have about 50 fasta files like that in a directory that I was to do the same thing to. I've been using this sed command for one file that works:

sed 's/>.*/>HC1193/' HC1993.fa > new/HC1993.fa

But now I want to loop this command through the directory and this is the command I have been using:

for i in $(ls *.fa | rev | cut -c 4- | rev | uniq)
do
    sed 's/>.*/>${i}/' ${i}.fa > new/${i}.fa
done

This command gives me this for all the new fasta file headers

HC1993.fa

>${i}
CCTGCATCTGCAA

Now I know there is a bunch of way to fix this, but could someone help me fix the bash loop I made? I want to learn my incorrect command and now to fix it. Thanks!

sequence fasta bash loop sed unix • 8.9k views
ADD COMMENT
1
Entering edit mode
6.1 years ago

Example fasta:

$ cat HC1993.fa 
>X58834 
CCTGCATCTGCAA

Expected output (assumption is that first line in each fasta file is fasta header):

$ cat HC1993.fa 
>HC1993
CCTGCATCTGCAA

in bash:

$ for i in *.fa; do sed "1s/.*/>${i%.fa}/" $i; done
>HC1993
CCTGCATCTGCAA

using GNU-parallel:

$  parallel  'sed "1s/.*/>{.}/" {}' ::: *.fa
>HC1993
CCTGCATCTGCAA
ADD COMMENT
0
Entering edit mode

Ohh thank you so much that worked!!!!

ADD REPLY
0
Entering edit mode

For future reference, code can be further shorted by:

$ parallel  'sed "/^>/ c {.}" {}' ::: *.fa
ADD REPLY
0
Entering edit mode
6.1 years ago
Joe 21k

As I understand it, you just want to make the header of the file, the filename?

e.g. given:

~/test/seqs$ ls
seq1.fasta  seq2.fasta  seq3.fasta
~/test/seqs$ cat seq*
>tpg|Magnaporthiopsis_incrustans|JF414846
 ACTGTAGTAGCTACGATCGATCAGATGATCACGTAGCATCGATCGATCATCGACTAGTAGATCACTCGACATAGATCCACATCAATAGATCATCATCATCATAATCGATCACTAGCAGC
>tpg|Pyricularia_pennisetigena|AB818016
GCAAGNTTCATGACGATGTAGAATGGCTTATCGAAGGGAGCAGGCCAGGGATTGAGGTCCGTCTCACGGGTTGGCTTCACTCCCCCACTGCCAGCCCTCTTGCTGCAACTCCACCAGAA
>tpg|Inocybe_sororia|EU525947
AACCANGCCGCGACGGCGGTGCGATCGGGAAACGCGGCGGTGGCGGAGGAATCGGCCATCCTTCACCATATCGGCCAAGGATTGTGGTTCCTGTAGGGCTCGCGCAGCCCAGGACGCGC

So, for file in *.fasta ; do sed -i "s/^>.*/>"${file%.*}"/gi" $file; done

Yeilds:

~/test/seqs$ for file in *.fasta ; do sed -i "s/^>.*/>"${file%.*}"/gi" "$file"; done
>seq1
ACTGTAGTAGCTACGATCGATCAGATGATCACGTAGCATCGATCGATCATCGACTAGTAGATCACTCGACATAGATCCACATCAATAGATCATCATCATCATAATCGATCACTAGCAGC
>seq2
GCAAGNTTCATGACGATGTAGAATGGCTTATCGAAGGGAGCAGGCCAGGGATTGAGGTCCGTCTCACGGGTTGGCTTCACTCCCCCACTGCCAGCCCTCTTGCTGCAACTCCACCAGAA
>seq3
AACCANGCCGCGACGGCGGTGCGATCGGGAAACGCGGCGGTGGCGGAGGAATCGGCCATCCTTCACCATATCGGCCAAGGATTGTGGTTCCTGTAGGGCTCGCGCAGCCCAGGACGCGC
ADD COMMENT
0
Entering edit mode

So yes your interpretation of what I would like is correct. Although I used your command and I'm getting this as an error:

sed: 1: "HC1993.fa": extra characters at the end of H command

And it is not making the new fasta files with the new headers.

ADD REPLY
0
Entering edit mode

Are you using Mac OS?

ADD REPLY
0
Entering edit mode
6.1 years ago

try changing 'sed 's/>.*/>${i}/' to sed "s/>.*/>${i}/".

ADD COMMENT
0
Entering edit mode
3.8 years ago
Anna ▴ 130

I worked out another solution using a combination of AWK, SED and Perl. This solution works for single files, where each file has one header and the goal is to replace the header with a modified version of the file name.

Assuming all your fasta files in the current directory end in ".fa", run the following:

ls -lrt  \| 
grep ".fa$" \| 
awk '{OFS="";print "var=",$NF"\nsed -i'.ori' -e \"s/>.*/>$var/g\" "$NF}'  \| 
perl -lne 'if ($_=~/^(var=.*)\.fa/) {print $1} else {print}' > run_command.sh

Explanation:

The first line ls -lrt does a list of the files in your dir.

The second line keeps only those lines of your list that end in .fa

The third and fourth lines, this one gets a bit tricky, are an AWK and Perl commands. The AWK command that prints what the command will look like. In this case, there are two lines of code that will be printed. The first one assigns the name to a variable called var and the second part of the awk command replaces the line with > in your original fasta file with the name of the file: for example if the file name is file1.fasta is the fasta header will be >file1 . The Perl command cleans things up.

The difference of this solution with others is that it prints the commands that are going to execute the name change into a file. In this way, it is possible to: i) make sure the names are changing for what you really want; ii) you keep a record of what happened. After you have checked that the name changes in the file `run_command.sh are what you want, you can just:

bash run_command.sh

and it should do the trick.

NB: mind the -i'.ori' option. This is specific for macOS as it won't overwrite the file as other GNU distributions of sed would do. Ah, and if you're copy-pasting the command, you may want to delete the \ characters before each |.

This solution is far from elegant, but works.

ADD COMMENT

Login before adding your answer.

Traffic: 2734 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6