Question: (Closed) Rename fasta-header based on a list
0
gravatar for genomes_and_MGEs
27 days ago by
genomes_and_MGEs0 wrote:

Hey everyone,

I have a multi-fasta file like this

>NZ_CP023010.1_3129429..3372047
atattgagctaa..
>NZ_MRWY01000004.1_16177..110237
tcagtcgactcct...
...

And a list of fasta-headers like this

>NZ_CP023010_Elizabethkingia anophelis FDAARGOS_198
>NZ_MRWY01000004_Klebsiella michiganensis_CAV1755
...

I would like to create a script, so that I could rename my multi-fasta file like this

>NZ_CP023010_Elizabethkingia anophelis FDAARGOS_198_3129429..3372047
atattgagctaa..
>NZ_MRWY01000004_Klebsiella michiganensis_CAV1755_16177..110237
tcagtcgactcct...
...

Could you help me out? Thanks!

sequence genome • 106 views
ADD COMMENTlink modified 27 days ago by Pierre Lindenbaum119k • written 27 days ago by genomes_and_MGEs0

Hello genomes_and_MGEs!

You have already asked this question a couple of times before this with minor variations. Please don't open new threads with similar questions. We are here to help but we like to see some effort on your part to solve questions. If you get stuck show what you have tried and someone will step in to help.

Outputting strain name on prokka annotation
Add strain name directly on fasta header

For this reason we have closed your question. This allows us to keep the site focused on the topics that the community can help with.

If you disagree please tell us why in a reply below, we'll be happy to talk about it.

Cheers!

ADD REPLYlink modified 27 days ago • written 27 days ago by genomax65k

Hey, sorry for insisting on this, but I still haven't solved this problem... So, I'm trying to optimize the seqkit replace command to replace the fasta-headers. Can you help me out?

ADD REPLYlink written 27 days ago by genomes_and_MGEs0

Just for given sample data.

$ sed 's/^>//' headers.txt  | perl -pne 's/(\w+_\w+)_/$1\t/' >  headers.tsv

$ cat headers.tsv 
NZ_CP023010     Elizabethkingia anophelis FDAARGOS_198
NZ_MRWY01000004 Klebsiella michiganensis_CAV1755

$ seqkit replace -p '^(.+?)\..+_' -k headers.tsv -r '{kv}_' seqs.fa 
>Elizabethkingia anophelis FDAARGOS_198_3129429..3372047
atattgagctaa..
>Klebsiella michiganensis_CAV1755_16177..110237
tcagtcgactcct...
ADD REPLYlink modified 27 days ago • written 27 days ago by shenwei3564.5k

This looks great, but I also need the accession number on the fasta header, such as

>Elizabethkingia anophelis_FDAARGOS_198_NZ_CP023010_3129429..3372047
atattgagctaa..
>Klebsiella michiganensis_CAV1755_NZ_MRWY01000004_16177..110237
tcagtcgactcct...

Can you help me optimize this?

ADD REPLYlink modified 27 days ago by genomax65k • written 27 days ago by genomes_and_MGEs0
1

depending on what the purpose of this fasta header reformatting is , I advise you to think it through before doing. The examples you give here are bound to give you troubles when processing this fasta file.

Keep in mind that for fasta format the first part of the header (== up to the first space) is the unique(!!) identifier, which in your current example will (I assume there might be others sequences from the same genus) likely not be unique enough. The given example will for instance certainly mess up any blastDB formatting of that fasta file.

ADD REPLYlink written 27 days ago by lieven.sterck4.5k
Please log in to add an answer.
The thread is closed. No new answers may be added.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 867 users visited in the last hour