Entering edit mode
4.5 years ago
tasmina.fm
•
0
We are working with assembly and annotation of whole genome sequence of bacteria with Linux command (soapdenovo2 and some related software). After assembly and annotation, we got unwanted character N within the fasta file. For this reason, we are not able to analyze it. Would you please help us how can remove it from whole genome sequence fasta file??
For this reason, we are not able to analyze it
=> Why not?N means ambiguous characters that might arise due to repetitive / difficult-to-sequence regions. Most genome assemblies contain them to some extend. The human reference (GRCh38) has > 150mio of them. Is this a short-read assembly?
While you may not want the
N
they likely signifies that your bacterial genome assembly is not complete. It is not unusual to see this especially if you are only using short-read data. You may need to investigate and re-do the assembly (repeat regions and over-sequencing can cause issues) or add long-read coverage (e.g. PacBio/Nanopore) to truly complete the assembly.