Editing fasta headers
1
0
Entering edit mode
12 months ago
Zoe • 0

Hi all,

I have fasta files containing sequences from different loci (one fasta file per individual) and would like to change the headers to then merge everything into one big fasta file.

This is the format of the first two lines of one my fasta files (called individual1-allele1.fa):

>lcl|contig_13517 
CGTTTTATGTCTAGGTTGTAGTTCTAACTCACTGCAACGACAATCAAATTGTAGGTGCA
>lcl|contig_22604
AAGGATTAAAAATGAAAACTATGCAAAACTATGAGGAATAAAACTTCTTACATCTGAACT

And I would like the headers to have the following format (gene|species|individual|allele1):

>contig_13517|species1|individual1|allele1
CGTTTTATGTCTAGGTTGTAGTTCTAACTCACTGCAACGACAATCAAATTGTAGGTGCA
>contig_22604|species1|individual1|allele1
AAGGATTAAAAATGAAAACTATGCAAAACTATGAGGAATAAAACTTCTTACATCTGAACT

So essentially need to remove the "lcl" and add the species name, individual name and allele 1.

Could you help me with this?

Cheers

fasta • 2.0k views
ADD COMMENT
1
Entering edit mode

Removing the prefix lcl is easy with sed (many examples are available by searching this site). But adding something depends on whether the content to add is constant or relies on existing values in the header. For constant content, you can use tools like awk. For the dynamic contents, you need to provide information for mapping existing values to new contents and use tools like seqkit replace.

ADD REPLY
1
Entering edit mode
12 months ago
Harley ▴ 10

Yes, I can help you with this. You can use a scripting language like Python to automate this task. Here is a Python code that you can use to rename the headers of your fasta files:

python

In this code, you need to replace "/path/to/fasta/files/" with the path to the directory containing your fasta files. The code will loop through each fasta file in the directory, extract the allele number from the file name, and loop through each sequence in the file. It will then extract the contig ID from the header, create a new header with the desired format, and write the new fasta sequence to a file called "output.fa".

Note that if you have multiple individuals, you will need to modify the code to loop through each individual and create a separate output file for each individual.

You can use Linux command line tools such as sed and awk to rename the headers of your fasta files. Here is a possible solution:

Use sed to remove the "lcl" prefix and replace it with the contig ID:

sed -i 's/^>lcl|\(.*\)/>\1/' *.fa

This command will replace all instances of "lcl|" in the headers of each fasta file with the contig ID, effectively removing the prefix and leaving only the contig ID.

for file in *.fa
do
  awk -v species="species1" -v individual="individual1" -v allele="${file%-*}" '/^>/ {gsub(/^>/,">" $2 "|" species "|" individual "|" allele); printf "%s\n",$0;next} {print}' $file > ${file%.fa}.new.fa
  mv ${file%.fa}.new.fa $file
done

This command will loop through each fasta file in the directory and add the species name, individual name, and allele number to the header of each sequence. It uses awk to find the header lines and sed to replace the contig ID with the new header format. The result is written to a new fasta file, which is then renamed to the original file name using the mv command. Note that this command assumes that the fasta file naming convention is "individual1-allele1.fa", where "individual1" is the individual name and "allele1" is the allele number. If your naming convention is different, you will need to modify the command accordingly.

The first time I answered the question, if there are any errors, please forgive me. The code block part is because I did not find the editor's code block usage image. I apologize. I hope it can be helpful to you, thank you.

ADD COMMENT
1
Entering edit mode

You can simply select the code block and press 101010 button above the editor frame, which adds four-spaces indents to the selected text. You can also use the symbol ``` (three backticks) to wrap the text without needing to add indents, e.g.,

some text

```
code block
```

would format to:

some text

code block
ADD REPLY

Login before adding your answer.

Traffic: 2452 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6