Having trouble with Seqkit Replace, trying to rename fasta headers from file
1
0
Entering edit mode
2.5 years ago
SaltedPork ▴ 160

My fastas look like:

>123456789.1
AGCT
>123456789.2
AGCT
>222221122.1
AGCT


The fasta does not have end of line characters so the sequence is all on one line. The fasta headers have a .1 .2 .3 and so on up to .8 at the end of them.

cat ids.txt
123456789   123456789.25/12/2019
222221122   222221122.03/03/2020


Desired Fasta:

>123456789.25/12/2019.1
AGCT
>123456789.25/12/2019.2
AGCT
>222221122 222221122.03/03/2020.1


So I'm trying to replace the text in the first column with the text in the second column (in this case adding dates into the headers), but preserving the .1 and .2 at the end of the headers.

Command:

./seqkit replace --pattern ' ^>(\w+).\d' --replacement ' {kv}' --kv-file ids.txt test.fasta --keep-key > test.out


test.out however prints the fasta but with the original headers, no error message, any ideas? I'm working on Windows have used dos2unix on all files

seqkit • 3.0k views
0
Entering edit mode

Input:

$cat ids.txt 123456789 123456789.25/12/2019 222221122 222221122.03/03/2020$ cat test.fa
>123456789.1
AGCT
>123456789.2
AGCT
>222221122.1
AGCT


output:

$seqkit replace --quiet -p '([0-9]+)(\.[0-9])' -r '{kv}${2}'  -k ids.txt test.fa

>123456789.25/12/2019.1
AGCT
>123456789.25/12/2019.2
AGCT
>222221122.03/03/2020.1
AGCT

1
Entering edit mode
2.1 years ago
Mark ★ 1.1k

Hey,

I know I'm 6 months too late but I came across this issue, I think it's because your pattern is returning multiple hits per query.

If you perform this command you'll see what I mean:

% grep "123456789" ids.txt
>123456789.1
>123456789.2


It's not stated explicitly, but the replace with the -k flag needs a 1:1 match. While the ids.txt is 1:1 the query result is not (multiple results are returned). Your ids.txt file needs to look like this:

123456789.1   123456789.1.25/12/2019
123456789.2  123456789.2.25/12/2019
123456789.3   123456789.3.25/12/2019
123456789.4   123456789.4.25/12/2019
....


Note that the new names are unique, while in your example if replace had worked you'd end up with duplicate header names (which is a problem!). Finally, I suggest you replace / delimiter with -. It will make your life easier down the track!

Good luck.