Question

What could be the rage for Average Coverage after Assembly

0

Entering edit mode

10 months ago

mail2steff ▴ 70

Dear all,

I performed viral genome assembly using metaviral spade and I wanted to map my reads back to contigs to know the coverage. I used bbmap for this purpose and got the following output:

NOTE:   Deleting contents of ref/genome/1 because reference is specified and overwrite=true
NOTE:   Deleting contents of ref/index/1 because reference is specified and overwrite=true
Writing reference.
Executing dna.FastaToChromArrays2 [Contigs/scaffolds_efa12_57550.fasta, 1, writeinthread=false, genscaffoldinfo=true, retain, waitforwriting=false, gz=true, maxlen=536670912, writechroms=true, minscaf=1, midpad=300, startpad=8000, stoppad=8000, nodisk=false]

Set genScaffoldInfo=true
Writing chunk 1
Set genome to 1

Loaded Reference:   0.004 seconds.
Loading index for chunk 1-1, build 1
No index available; generating from reference genome: /dss/dssfs02/lwp-dss-0001/u7b03/u7b03-dss-0000/ra78zut/Marla_virome/ref/index/1/chr1_index_k13_c14_b1.block
Indexing threads started for block 0-1
Indexing threads finished for block 0-1
Generated Index:    1.782 seconds.
Analyzed Index:     2.814 seconds.
Cleared Memory:     0.385 seconds.
Processing reads in paired-ended mode.
Started read stream.
Started 56 mapping threads.
Note: Coverage capped at 65535; please use the flag 32bit for higher values.
Detecting finished threads: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55

   ------------------   Results   ------------------   

Genome:                 1
Key Length:             13
Max Indel:              16000
Minimum Score Ratio:    0.56
Mapping Mode:           normal
Reads Used:             16237318    (2445148296 bases)

Mapping:            148.781 seconds.
Reads/sec:          109135.83
kBases/sec:         16434.57


Pairing data:       pct pairs   num pairs   pct bases      num bases

mated pairs:         97.4414%     7910931    97.5985%     2386428704
bad pairs:            1.0223%       82996     1.0227%       25007664
insert size avg:      324.51


Read 1 data:        pct reads   num reads   pct bases      num bases

mapped:              99.0249%     8039494    99.1715%     1212330330
unambiguous:         99.0228%     8039327    99.1695%     1212305191
ambiguous:            0.0021%         167     0.0021%          25139
low-Q discards:       0.0000%           0     0.0000%              0

perfect best site:   79.2000%     6429976    79.3922%      970536645
semiperfect site:    79.2408%     6433291    79.4332%      971037095
rescued:              1.0093%       81944

Match Rate:               NA           NA    98.4430%     1202221579
Error Rate:          19.9178%     1601301     1.5509%       18939903
Sub Rate:            19.8272%     1594012     0.6345%        7748164
Del Rate:             1.0641%       85552     0.7292%        8905711
Ins Rate:             2.1887%      175959     0.1872%        2286028
N Rate:               0.1302%       10464     0.0061%          74559


Read 2 data:        pct reads   num reads   pct bases      num bases

mapped:              98.6765%     8011209    98.8022%     1208045570
unambiguous:         98.6751%     8011097    98.8009%     1208028685
ambiguous:            0.0014%         112     0.0014%          16885
low-Q discards:       0.0000%           1     0.0000%             25

perfect best site:   68.5175%     5562699    68.6520%      839401030
semiperfect site:    68.5566%     5565880    68.6912%      839881042
rescued:              1.3663%      110922

Match Rate:               NA           NA    98.2873%     1196040747
Error Rate:          30.5078%     2444046     1.7065%       20766421
Sub Rate:            30.3131%     2428452     0.8256%       10046664
Del Rate:             1.2478%       99963     0.7262%        8837337
Ins Rate:             2.0474%      164022     0.1547%        1882420
N Rate:               0.0811%        6497     0.0062%          75739

Reads:                                  16237318
Mapped reads:                           16050703
Mapped bases:                           2436426603
Ref scaffolds:                          1
Ref bases:                              57550

Percent mapped:                         98.851
Percent proper pairs:                   97.441
Average coverage:                       42335.823
Average coverage with deletions:        42290.789
Standard deviation:                     6444.762
Percent scaffolds with any coverage:    100.00
Percent of reference bases covered:     100.00

Total time:         154.154 seconds.

So the Average coverage was 42335.823. Is this in the advisable range? What is the best coverage for near-perfect assembly? How do I interpret this?

Thank you in advance

viral-genome spades coverage de-novo-assembly • 812 views

ADD COMMENT • link updated 10 months ago by Ram 43k • written 10 months ago by mail2steff ▴ 70

0

Entering edit mode

How many multi-fasta sequences are in scaffolds.fasta file?

So the Average coverage was 42335.823. Is this in the advisable range? What is the best coverage for near-perfect assembly?

Since this is a meta assembly there is no way to know the "truth" in terms of what was actually there to begin with. But you clearly have humongous amount of coverage.

ADD REPLY • link 10 months ago by GenoMax 142k

0

Entering edit mode

I edited the post. It has only one contig with a length of 57550 bps. It's a viral genome. I am also suspicious of the coverage.

ADD REPLY • link 10 months ago by mail2steff ▴ 70

0

Entering edit mode

Hmm that seems suspicious that you got a single contig from a metaviral analysis. Did you know there was only one virus in the sample? Since you have millions of reads and 97%+ are aligning it is not unusual to see that high a coverage number.

ADD REPLY • link 10 months ago by GenoMax 142k

0

Entering edit mode

Thank you for the reply. It was a single viral isolate procedure and we expected to have only one viral species. So Can I go ahead with this assembly?

ADD REPLY • link 10 months ago by mail2steff ▴ 70

0

Entering edit mode

Is it similar to a known/related virus? Does it appear to contain the entire genome?

Having too much coverage (over sampling) can be an issue for getting good assemblies. There are tools to normalize the data before assembly (like bbnorm.sh). So you will need to judge how good the assembly is.

ADD REPLY • link 10 months ago by GenoMax 142k

0

Entering edit mode

I verified the contigs using checkV and this contig is a valid viral complete contigs and the it is also ~80 similarity with the closet reference genome. Sequencing depth is around 30X that why I didn't think of doing normalization before. But Ill check bbnorm.sh

ADD REPLY • link 10 months ago by mail2steff ▴ 70

1

Entering edit mode

There is no way your sequencing depth is 30x. Just going by aligned number of bases above there are a total of ~2.4 Billion bases aligned. If the genome is only 57500 bases then you end up with 42000x coverage based on sequence data.

That said if the genome matches a known relative and appears complete then you may be fine using the contig you have.

ADD REPLY • link 10 months ago by GenoMax 142k