Question

What is the normal N50 range when assembly the virus genome.

0

Entering edit mode

5.9 years ago

FAST_GENOME ▴ 60

Dear All, I just finished my very first virus assembly by SPAdes. The following is the result from QUAST and command line.

My question is:

1, what is the normal N50? My N50 is 742, is this very low? 2. How should I choose the best k-mer length?

Thanks a lot.

Result

contigs 478

contigs (>= 0 bp)   11005

contigs (>= 1000 bp)    78

contigs (>= 5000 bp)    1

contigs (>= 10000 bp)   1

contigs (>= 25000 bp)   0

contigs (>= 50000 bp)   0

Largest contig  14128

Total length    390242

Total length (>= 0 bp)  3584698

Total length (>= 1000 bp)   138381

Total length (>= 5000 bp)   14128

Total length (>= 10000 bp)  14128

Total length (>= 25000 bp)  0

Total length (>= 50000 bp)  0

N50 742

N75 580

L50 145

L75 296

GC (%)  50.03

Mismatches  

N's 0

N's per 100 kbp 0

Command line

$bbduk in=$r1 in2=$r2 out=trimmed.fq ktrim=r k=23 mink=11 hdist=1 ref=$bbduk_ref tbo tpe

$bbnorm in=trimmed.fq out=normalized.fq target=100 min=5

$spades -k 21,41,71,101,127 -o spades_out --12 trimmed.fq --careful

$quast spades_out/contigs.fasta -o quast_out_contigs -t 16 -l 4038-Roc

assembly n50 • 3.0k views

ADD COMMENT • link 5.9 years ago by FAST_GENOME ▴ 60

0

Entering edit mode

Hello archie.w.lee,

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

Thank you!

ADD REPLY • link 5.9 years ago by lakhujanivijay 5.8k

0

Entering edit mode

thank you very much!

ADD REPLY • link 5.9 years ago by FAST_GENOME ▴ 60

0

Entering edit mode

k-mer values would depend on average read length in the data, have you tried to run Spades without specifying the kmer and let it choose the best kmer values depending on the read length? What is expected genome size?

ADD REPLY • link 5.9 years ago by Sej Modha 5.3k

0

Entering edit mode

Thanks. I am trying the following command and will update the results when it finishes. The genome is about 9k.

$spades -o spades_out --12 trimmed.fq --careful
&
$spades -o spades_out --12 normalized.fq --careful

ADD REPLY • link 5.9 years ago by FAST_GENOME ▴ 60

0

Entering edit mode

archie.w.lee : Actually you should try using tadpole.sh from BBMap. It is supposed to work very well with viral genome assemblies.

ADD REPLY • link 5.9 years ago by GenoMax 142k

0

Entering edit mode

Thanks, I will try and update the results.

ADD REPLY • link 5.9 years ago by FAST_GENOME ▴ 60

0

Entering edit mode

Dear all, I blast my assembly contig and all are Homo sapiens mitochondrion. Is that a host genome contamination? Thanks

ADD REPLY • link 5.9 years ago by FAST_GENOME ▴ 60

0

Entering edit mode

Seems likely doesn't it?

ADD REPLY • link 5.9 years ago by Joe 21k

0

Entering edit mode

Please either use the ADD COMMENT button to ask this under the relevant answer / comment, or open a new question altogether. The Add your answer space should be reserved to answers to the top-level (original) question.

Is the host human? Then yes, it is host DNA (or RNA?) you are seeing. How did you perform DNA extraction? Did you enrich for virus particles somehow? How many contigs did you obtain from the assembly, and are really all of them from mitochondrial DNA?

ADD REPLY • link 5.9 years ago by h.mon 35k

0

Entering edit mode

wow, how did you know the assembly contigs all from human mitochondrial? The total contigs are 478. The largest one is 14128. but the genome of the virus is about 9k.

I blasted about 2k raw reads and random pick 80-100 assembled contigs, they all belong to Human.

ADD REPLY • link 5.9 years ago by FAST_GENOME ▴ 60

0

Entering edit mode

Blast all contigs, my experience is just a few or even one is the viral genome, the rest is host contamination.

If you have a reference viral genome, you can use bbduk.sh to filter viral reads, or even bbsplit.sh with human and viral genomes to separate reads pertaining to each genome. You could then assemble using just the viral reads.

ADD REPLY • link 5.9 years ago by h.mon 35k

0

Entering edit mode

Thank you so much. I will do that. I just check the ddbuk manual, should I use the following command line? Can you please give me some pointer. The virus genome is about 9k.

$ bbduk.sh in=reads.fq out=unmatched.fq outm=matched.fq literal=ACGTACGTACGTACGTAC k=18 mm=f hdist=2

ADD REPLY • link 5.9 years ago by FAST_GENOME ▴ 60

0

Entering edit mode

Don't use literal=, use the virus genome as reference, with ref=virus.fa.

ADD REPLY • link 5.9 years ago by h.mon 35k

0

Entering edit mode

Thank you so much for your replying and information.

ADD REPLY • link 5.9 years ago by FAST_GENOME ▴ 60

0

Entering edit mode

Spades runs several different kmer lengths that it determines automatically from your sequence data. Let it do the hard work.

As for N50 length, that'll depend on your sequencing platform.

ADD REPLY • link 5.9 years ago by Joe 21k

score 0 · Answer 1 · 2018-06-13

For a viral genome, you don't want to focus on N50, you want to retrieve the full length genome. Usually, viral reads are a lot more common than host reads, and as viral genomes are small, so you will have plenty of coverage even with very low depth sequencing. There will be some contigs representing the host genome, or contaminants and other artifacts, so N50 is not a good measure when you are interested on only one or a few contigs.

What I usually do is after initial assembly, I blast the assembly to identify viral contigs. In case the full length genome is not recovered, I may try to either assemble all viral contigs with CAP3, or map reads to viral contigs and re-assemble. Another option is to extend viral contigs with Tadpole or Mapsembler, but I never used this later approach so far.