seqkit replace multifasta file
2
0
Entering edit mode
2.2 years ago
mthm ▴ 50

the headers in my multifasta files are of different formats and names, the only thing in common among all of them is the order of headers. I have provided a list of names that should be replaced according to the orders. I was thinking of using seqkit but I can't find the correct syntax for it

>EOG09150JA6_/storage/home/users/Dlittoralis_73_scf.fasta_jcf7180000720927_64871-66017 117 bp
MRRNNYPYQPLNQHPAPSGPAGHDALEAENERAAEELQQKIGALKSLTIDIGNEVRYQDK
LLRGIDDDMDRTGGFLGNTMTRVVRLAKQGGGSKQMCYMFLFVLFVFVLLWLTLKFK
>EOG09150JA6_/storage/home/users/Dlummei_81_scf.fasta_jcf7180000898911_4133-4655 117 bp
MRRNNYPYQPLNQHPAPSGQAGHDALEAENERAAEELQQKIGALKSLTIDIGNEVRYQDK
LLRGIDDDMDRTGGFLGNTMTRVVRLAKQGGGSKQMCYMFLFVLFVFVLLWLTLKFK
>dmoj37yC5.fa_scf7180000237413_9322-9672 117 bp
MRRNNYPYQPLNQHPAPSGPAGHDALEAENERAAEELQQKIGALKSLTIDIGNEVRYQDK

so far this is what I came up with

seqkit replace -p "^(\S+)"

but symbol -r "{kv}" needs a tab delimited file, while in my case the headers are variable so I can only provide the new names based on order

seqkit multifasta • 1.2k views
ADD COMMENT
0
Entering edit mode

you need to be more specific, what do you want to rename it as? Give a full example of one record, input/output.

Once tasks are more complicated writing a very simply Python script is usually the way to go.

ADD REPLY
2
Entering edit mode

In BioPython the solution would be like so

from Bio import SeqIO

mapping = {}
for line in open('names.txt'):
    oldname, newname = line.strip().split()
    mapping[oldname] = newname

recs = SeqIO.parse("input.fa", 'fasta')
recs = list(recs)

# Rename the records
for rec in recs:
    rec.id = mapping[rec.id]

SeqIO.write(recs, "output.fa", 'fasta')
ADD REPLY
5
Entering edit mode
2.2 years ago

use paste:

$ cat names.txt 
seq1
seq2
seq3

$ seqkit seq -n  test.fasta
EOG09150JA6 _/storage/home/users/Dlittoralis_73_scf.fasta_jcf7180000720927_64871-66017 117 bp
EOG09150JA6 _/storage/home/users/Dlummei_81_scf.fasta_jcf7180000898911_4133-4655 117 bp
dmoj37yC5.fa _scf7180000237413_9322-9672 117 bp

# or 
#   paste names.txt <(seqkit fx2tab test.fasta | cut -f 2) | seqkit tab2fx 
$ seqkit fx2tab test.fasta | cut -f 2 | paste names.txt - | seqkit tab2fx

>seq1
MRRNNYPYQPLNQHPAPSGPAGHDALEAENERAAEELQQKIGALKSLTIDIGNEVRYQDK
LLRGIDDDMDRTGGFLGNTMTRVVRLAKQGGGSKQMCYMFLFVLFVFVLLWLTLKFK
>seq2
MRRNNYPYQPLNQHPAPSGQAGHDALEAENERAAEELQQKIGALKSLTIDIGNEVRYQDK
LLRGIDDDMDRTGGFLGNTMTRVVRLAKQGGGSKQMCYMFLFVLFVFVLLWLTLKFK
>seq3
MRRNNYPYQPLNQHPAPSGPAGHDALEAENERAAEELQQKIGALKSLTIDIGNEVRYQDK
ADD COMMENT
4
Entering edit mode
2.2 years ago
$ seqkit replace -p '.*' -r 'seq' test.fa | seqkit -w 0 rename | sed '1s/$/_1/'

>seq_1
MRRNNYPYQPLNQHPAPSGPAGHDALEAENERAAEELQQKIGALKSLTIDIGNEVRYQDKLLRGIDDDMDRTGGFLGNTMTRVVRLAKQGGGSKQMCYMFLFVLFVFVLLWLTLKFK
>seq_2 
MRRNNYPYQPLNQHPAPSGQAGHDALEAENERAAEELQQKIGALKSLTIDIGNEVRYQDKLLRGIDDDMDRTGGFLGNTMTRVVRLAKQGGGSKQMCYMFLFVLFVFVLLWLTLKFK
>seq_3 
MRRNNYPYQPLNQHPAPSGPAGHDALEAENERAAEELQQKIGALKSLTIDIGNEVRYQDK


$ awk  -v RS=">" 'NR > 1 {print ">seq_"NR-1,$0}' test.fa  | awk 'NF {print $1}'

>seq_1
MRRNNYPYQPLNQHPAPSGPAGHDALEAENERAAEELQQKIGALKSLTIDIGNEVRYQDK
LLRGIDDDMDRTGGFLGNTMTRVVRLAKQGGGSKQMCYMFLFVLFVFVLLWLTLKFK
>seq_2
MRRNNYPYQPLNQHPAPSGQAGHDALEAENERAAEELQQKIGALKSLTIDIGNEVRYQDK
LLRGIDDDMDRTGGFLGNTMTRVVRLAKQGGGSKQMCYMFLFVLFVFVLLWLTLKFK
>seq_3
MRRNNYPYQPLNQHPAPSGPAGHDALEAENERAAEELQQKIGALKSLTIDIGNEVRYQDK
ADD COMMENT

Login before adding your answer.

Traffic: 1531 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6