Manipulating Fasta headers
1
0
Entering edit mode
9.0 years ago
zgayk ▴ 90

I am sorry if this has been asked before, but I have a genome assembly file that I just converted from .bam to fasta format in order to start annotation. I would like to run CEGMA on this assembly, because I have concerns about the quality, but the problem is that the default header format when the fasta was created is not acceptable. This is because in the current format here are 5237924 sequences with FASTA headers that either contain only digits or have just digits followed by a space. E.g.

>1
>22 |
>333 xyz

I need headers that have no spaces and also have non-numeric characters (letters) as the current headers don't work with blast. Ideally I would like to simply name each sequence as a scaffold followed by a number identifier for the scaffold (so that each header would be named scaffoldn where n is the number of each scaffold in the entire assembly. But, my coding experience is very limited and any suggestions you might have would be very helpful.

Thanks,
Zach

Assembly • 3.0k views
ADD COMMENT
1
Entering edit mode
9.0 years ago

With the BBMap package:

bbrename.sh in=file.fasta out=renamed.fasta prefix=scaffold

They will be named "scaffold_0", "scaffold_1", etc.

ADD COMMENT
0
Entering edit mode

Thank you very much Brian. It worked easily.

ADD REPLY

Login before adding your answer.

Traffic: 1949 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6