Question: Manipulating Fasta headers
gravatar for zgayk
3.0 years ago by
United States
zgayk90 wrote:

I am sorry if this has been asked before, but I have a genome assembly file that I just converted from .bam to fasta format in order to start annotation. I would like to run CEGMA on this assembly, becuase I have concerns about the qualiity, but the problem is that the defaul header format when the fasta was created is not acceptable. This is becuase in the current format here are 5237924 sequences with FASTA headers that either contain only digits or have just digits followed by a space. E.g.


>22 |

>333 xyz

I need headers that have no spaces and also have non-numeric characters (letters) as the current headers don't work with blast. Ideally I would like to simply name each sequence as a scaffold followed by a number identifier for the scaffold (so that each header would be named scaffoldn where n is the number of each scaffold in the entire assembly. But, my coding experience is very limited and any suggestions you might have would be very helpful.




assembly • 1.3k views
ADD COMMENTlink modified 3.0 years ago by Brian Bushnell15k • written 3.0 years ago by zgayk90
gravatar for Brian Bushnell
3.0 years ago by
Walnut Creek, USA
Brian Bushnell15k wrote:

With the BBMap package: in=file.fasta out=renamed.fasta prefix=scaffold

They will be named "scaffold_0", "scaffold_1", etc.

ADD COMMENTlink written 3.0 years ago by Brian Bushnell15k

Thank you very much Brian. It worked easily.

ADD REPLYlink written 3.0 years ago by zgayk90
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1758 users visited in the last hour