Question: What is the normal N50 range when assembly the virus genome.
0
gravatar for FAST_GENOME
6 months ago by
FAST_GENOME50
FAST_GENOME50 wrote:

Dear All, I just finished my very first virus assembly by SPAdes. The following is the result from QUAST and command line.

My question is:

1, what is the normal N50? My N50 is 742, is this very low? 2. How should I choose the best k-mer length?

Thanks a lot.

Result

contigs 478

contigs (>= 0 bp)   11005

contigs (>= 1000 bp)    78

contigs (>= 5000 bp)    1

contigs (>= 10000 bp)   1

contigs (>= 25000 bp)   0

contigs (>= 50000 bp)   0

Largest contig  14128

Total length    390242

Total length (>= 0 bp)  3584698

Total length (>= 1000 bp)   138381

Total length (>= 5000 bp)   14128

Total length (>= 10000 bp)  14128

Total length (>= 25000 bp)  0

Total length (>= 50000 bp)  0

N50 742

N75 580

L50 145

L75 296

GC (%)  50.03

Mismatches  

N's 0

N's per 100 kbp 0

Command line

$bbduk in=$r1 in2=$r2 out=trimmed.fq ktrim=r k=23 mink=11 hdist=1 ref=$bbduk_ref tbo tpe

$bbnorm in=trimmed.fq out=normalized.fq target=100 min=5

$spades -k 21,41,71,101,127 -o spades_out --12 trimmed.fq --careful

$quast spades_out/contigs.fasta -o quast_out_contigs -t 16 -l 4038-Roc
n50 assembly • 448 views
ADD COMMENTlink modified 6 months ago • written 6 months ago by FAST_GENOME50

Hello archie.w.lee,

Please use the formatting bar (especially the code option) to present your post better. I've done it for you this time.
code_formatting

Thank you!

ADD REPLYlink written 6 months ago by Vijay Lakhujani3.4k

thank you very much!

ADD REPLYlink written 6 months ago by FAST_GENOME50

k-mer values would depend on average read length in the data, have you tried to run Spades without specifying the kmer and let it choose the best kmer values depending on the read length? What is expected genome size?

ADD REPLYlink written 6 months ago by Sej Modha3.9k

Thanks. I am trying the following command and will update the results when it finishes. The genome is about 9k.

$spades -o spades_out --12 trimmed.fq --careful
&
$spades -o spades_out --12 normalized.fq --careful
ADD REPLYlink written 6 months ago by FAST_GENOME50

archie.w.lee : Actually you should try using tadpole.sh from BBMap. It is supposed to work very well with viral genome assemblies.

ADD REPLYlink written 6 months ago by genomax59k

Thanks, I will try and update the results.

ADD REPLYlink written 6 months ago by FAST_GENOME50

Dear all, I blast my assembly contig and all are Homo sapiens mitochondrion. Is that a host genome contamination? Thanks

ADD REPLYlink written 6 months ago by FAST_GENOME50

Seems likely doesn't it?

ADD REPLYlink written 6 months ago by jrj.healey9.1k

Please either use the ADD COMMENT button to ask this under the relevant answer / comment, or open a new question altogether. The Add your answer space should be reserved to answers to the top-level (original) question.

Is the host human? Then yes, it is host DNA (or RNA?) you are seeing. How did you perform DNA extraction? Did you enrich for virus particles somehow? How many contigs did you obtain from the assembly, and are really all of them from mitochondrial DNA?

ADD REPLYlink written 6 months ago by h.mon21k

wow, how did you know the assembly contigs all from human mitochondrial? The total contigs are 478. The largest one is 14128. but the genome of the virus is about 9k.

I blasted about 2k raw reads and random pick 80-100 assembled contigs, they all belong to Human.

ADD REPLYlink written 5 months ago by FAST_GENOME50

Blast all contigs, my experience is just a few or even one is the viral genome, the rest is host contamination.

If you have a reference viral genome, you can use bbduk.sh to filter viral reads, or even bbsplit.sh with human and viral genomes to separate reads pertaining to each genome. You could then assemble using just the viral reads.

ADD REPLYlink written 5 months ago by h.mon21k

Thank you so much. I will do that. I just check the ddbuk manual, should I use the following command line? Can you please give me some pointer. The virus genome is about 9k.

$ bbduk.sh in=reads.fq out=unmatched.fq outm=matched.fq literal=ACGTACGTACGTACGTAC k=18 mm=f hdist=2
ADD REPLYlink written 5 months ago by FAST_GENOME50

Don't use literal=, use the virus genome as reference, with ref=virus.fa.

ADD REPLYlink written 5 months ago by h.mon21k

Thank you so much for your replying and information.

ADD REPLYlink written 5 months ago by FAST_GENOME50

Spades runs several different kmer lengths that it determines automatically from your sequence data. Let it do the hard work.

As for N50 length, that'll depend on your sequencing platform.

ADD REPLYlink written 6 months ago by jrj.healey9.1k
0
gravatar for h.mon
6 months ago by
h.mon21k
Brazil
h.mon21k wrote:

For a viral genome, you don't want to focus on N50, you want to retrieve the full length genome. Usually, viral reads are a lot more common than host reads, and as viral genomes are small, so you will have plenty of coverage even with very low depth sequencing. There will be some contigs representing the host genome, or contaminants and other artifacts, so N50 is not a good measure when you are interested on only one or a few contigs.

What I usually do is after initial assembly, I blast the assembly to identify viral contigs. In case the full length genome is not recovered, I may try to either assemble all viral contigs with CAP3, or map reads to viral contigs and re-assemble. Another option is to extend viral contigs with Tadpole or Mapsembler, but I never used this later approach so far.

ADD COMMENTlink written 6 months ago by h.mon21k

Thanks. Let me blast the assembly first. And a quick question about "map reads to viral contigs and re-assemble". Which tools do you recommend? and how to evaluate the results?

ADD REPLYlink written 6 months ago by FAST_GENOME50
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2050 users visited in the last hour