Removal of unwanted character "N" from Assambly and Annotated file of whole genome sequemce

0

Entering edit mode

4.5 years ago

tasmina.fm • 0

We are working with assembly and annotation of whole genome sequence of bacteria with Linux command (soapdenovo2 and some related software). After assembly and annotation, we got unwanted character N within the fasta file. For this reason, we are not able to analyze it. Would you please help us how can remove it from whole genome sequence fasta file??

assembly genome next-gen • 677 views

ADD COMMENT • link updated 4.3 years ago by Biostar 20 • written 4.5 years ago by tasmina.fm • 0

0

Entering edit mode

For this reason, we are not able to analyze it => Why not?

N means ambiguous characters that might arise due to repetitive / difficult-to-sequence regions. Most genome assemblies contain them to some extend. The human reference (GRCh38) has > 150mio of them. Is this a short-read assembly?

ADD REPLY • link 4.5 years ago by ATpoint 81k

0

Entering edit mode

we got unwanted character N within the fasta file

While you may not want the N they likely signifies that your bacterial genome assembly is not complete. It is not unusual to see this especially if you are only using short-read data. You may need to investigate and re-do the assembly (repeat regions and over-sequencing can cause issues) or add long-read coverage (e.g. PacBio/Nanopore) to truly complete the assembly.

ADD REPLY • link 4.5 years ago by GenoMax 141k

Login before adding your answer.