Question: Rename several fasta-headers
0
gravatar for genomes_and_MGEs
6 weeks ago by
genomes_and_MGEs0 wrote:

Hey guys,

I have a multi-fasta file containing several extracted regions, such as

>NZ_KI973281.1_1234..56789
atattgagctaaaaaaatcagttttccca...
>NZ_LAAL01000032.1_5456..32476
tgcagaagtaagggggtaacaccatgcct...
...

I would like to include strain name on fasta header, such as

>Enterobacter_sp._MGH_6_NZ_KI973281.1_1234..56789
atattgagctaaaaaaatcagttttccca...
>Enterobacter_hormaechei_subsp._xiangfangensis_strain_34984_NZ_LAAL01000032.1_5456..32476
tgcagaagtaagggggtaacaccatgcct...
...

Could you please help me out? Thanks!

assembly genome • 218 views
ADD COMMENTlink modified 6 weeks ago by Pierre Lindenbaum119k • written 6 weeks ago by genomes_and_MGEs0
1

See the following post

Renaming Entries In A Fasta File

and many others on its right panel,

like these ones: Rename fasta headers,

How to move the last 4 characters of all FASTA headers to the beginning?,

Renaming fasta file headers, etc.

There are many awk- or sed-scripts mentioned inside,

they may give you some hints.

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by natasha.sernova3.4k

Where are the strain names coming from? A separate file/NCBI search?

ADD REPLYlink written 6 weeks ago by genomax65k

From simple NCBI search! I don't have a separate file with the corresponding strain name for each accession... And the suggested links can't help me on this issue. Can you help me out? Thanks!

ADD REPLYlink written 5 weeks ago by genomes_and_MGEs0
2

The following will get you part way there.

Step 1: Look up names of the organisms in your blast result. (following work with the small snippet example above)

awk -F '>|_' '/^>/ {print $2"_"$3}' test | xargs -n 1 sh -c 'efetch -db nuccore -id "$0" -format docsum | xtract -pattern DocumentSummary -element Caption,Organism' > names.txt

names.txt now contains the names of the organisms.

Step 2: Use one of the solutions in Renaming fasta headers according to a matching name list to do the replacements. There is small issue though. names.txt does not contain the version number for the accession so the solutions may need to be changed to suit your needs.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by genomax65k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 911 users visited in the last hour