Having trouble with Seqkit Replace, trying to rename fasta headers from file
1
0
Entering edit mode
22 months ago
SaltedPork ▴ 120

My fastas look like:

>123456789.1
AGCT
>123456789.2
AGCT
>222221122.1
AGCT

The fasta does not have end of line characters so the sequence is all on one line. The fasta headers have a .1 .2 .3 and so on up to .8 at the end of them.

cat ids.txt
123456789   123456789.25/12/2019
222221122   222221122.03/03/2020

Desired Fasta:

>123456789.25/12/2019.1
AGCT
>123456789.25/12/2019.2
AGCT
>222221122 222221122.03/03/2020.1

So I'm trying to replace the text in the first column with the text in the second column (in this case adding dates into the headers), but preserving the .1 and .2 at the end of the headers.

Command:

./seqkit replace --pattern ' ^>(\w+).\d' --replacement ' {kv}' --kv-file ids.txt test.fasta --keep-key > test.out

test.out however prints the fasta but with the original headers, no error message, any ideas? I'm working on Windows have used dos2unix on all files

seqkit • 1.9k views
ADD COMMENT
0
Entering edit mode

Input:

$ cat ids.txt 
123456789   123456789.25/12/2019
222221122   222221122.03/03/2020

$ cat test.fa                                                               
>123456789.1
AGCT
>123456789.2
AGCT
>222221122.1
AGCT

output:

$ seqkit replace --quiet -p '([0-9]+)(\.[0-9])' -r '{kv}${2}'  -k ids.txt test.fa 

>123456789.25/12/2019.1
AGCT
>123456789.25/12/2019.2
AGCT
>222221122.03/03/2020.1
AGCT
ADD REPLY
1
Entering edit mode
16 months ago
Mark ★ 1.0k

Hey,

I know I'm 6 months too late but I came across this issue, I think it's because your pattern is returning multiple hits per query.

If you perform this command you'll see what I mean:

% grep "123456789" ids.txt
>123456789.1
>123456789.2

It's not stated explicitly, but the replace with the -k flag needs a 1:1 match. While the ids.txt is 1:1 the query result is not (multiple results are returned). Your ids.txt file needs to look like this:

123456789.1   123456789.1.25/12/2019
123456789.2  123456789.2.25/12/2019
123456789.3   123456789.3.25/12/2019
123456789.4   123456789.4.25/12/2019
....

Note that the new names are unique, while in your example if replace had worked you'd end up with duplicate header names (which is a problem!). Finally, I suggest you replace / delimiter with -. It will make your life easier down the track!

Good luck.

ADD COMMENT

Login before adding your answer.

Traffic: 2050 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6