Question

Contigs number vs NNN gap % in WGS

0

Entering edit mode

9 months ago

Pegasus ▴ 110

Hi all,

I have two fasta files of bacterial whole genome sequence assembly (same genome)

generated by spades
generated by spades > extra scaffolding step using multi-csar, by comparing it to 5 close genomes.

quast statistics as below :

Genome 1 : N50 = 2133845 N75= 1046574 (#contigs =12) N's per 100 kbp = 0.0

Genome 2 : N50 = 4052556 N75= 4052556 (#contigs = 2) N's per 100 kbp = 51.80

It is clear that the scaffolder reduced the contigs number by generating NNNNN bridges among them.

NNNNs are considered gaps and lower the accuracy of the genome, however, it helps in predicting the order of these contigs and so maybe it is recommended in generating some tasks like circular visualization!!

If this fatsa file will be used in next annotation step, and in downstream analysis like;

-AntiSMASH secondary metabolite prediction tool, RNA-seq etc...

Which file do you recommend keeping and using in the downstream analysis!!!??

Could you please bump this question. GenoMax

Thank you in advance

WGS • 812 views

ADD COMMENT • link updated 8 months ago by GenoMax 147k • written 9 months ago by Pegasus ▴ 110

1

Entering edit mode

It looks like you copy-pasted the content from here to a new post (without going into edit mode, I might add, losing all formatting in the process) and deleted this post. Did you do that to get your post on the front page again?

If you did, and you wish to do something like this the next time, please tag a moderator and request a bump. I've un-deleted this post and given it a bump. I've also deleted your newer post.

ADD REPLY • link 9 months ago by Ram 44k

0

Entering edit mode

Sorry for that, I modified the post and requested a bump.

ADD REPLY • link 8 months ago by Pegasus ▴ 110

1

Entering edit mode

Curious as to why the contig number is important. If you are planning to do no additional sequencing then the data is essentially static at this point. Having the N's gave you an idea of what parts of the genome are likely missing from your data.

N's are not going to contribute to information content of downstream analyses.

ADD REPLY • link 8 months ago by GenoMax 147k