Hi, I’ve downloaded several assemblies from RefSeq and will be generating a custom database using the makeblastdb command. The file headers need to be in a specific format to subsequently fetch individual sequences from blastn_out search results using blastdbcmd.
The headers for all fasta and multifasta files need to be formatted something like this: >gnl|uniqID|seq1 The assembly files downloaded from NCBI assembly site have headers like this:
>NZ_BP006234.1 Organism name strain AC2-110 genome >NZ_BP006234.1 Organism name strain AC2-110 plasmid pxxx5, complete sequence or like this: >NZ_BB45345435.1 Organism name 73645 n_819_l_244_c_44.200821, whole genome shotgun sequence >NZ_BB45345435.1 Organism name 73645 n_773_l_201_c_51.631840, whole genome shotgun sequence They should look like this: >NZ_BP006234.1|seq1 >NZ_BP006234.1|seq2 >NZ_BP006234.1|seq3
I think that the sed command can be used in this case but don't know what to provide so that it removes part and keeps the correct part of the string. Thank you for helping.