Entering edit mode
10 months ago
mail2steff
▴
70
Dear all,
I performed viral genome assembly using metaviral spade and I wanted to map my reads back to contigs to know the coverage. I used bbmap for this purpose and got the following output:
NOTE: Deleting contents of ref/genome/1 because reference is specified and overwrite=true
NOTE: Deleting contents of ref/index/1 because reference is specified and overwrite=true
Writing reference.
Executing dna.FastaToChromArrays2 [Contigs/scaffolds_efa12_57550.fasta, 1, writeinthread=false, genscaffoldinfo=true, retain, waitforwriting=false, gz=true, maxlen=536670912, writechroms=true, minscaf=1, midpad=300, startpad=8000, stoppad=8000, nodisk=false]
Set genScaffoldInfo=true
Writing chunk 1
Set genome to 1
Loaded Reference: 0.004 seconds.
Loading index for chunk 1-1, build 1
No index available; generating from reference genome: /dss/dssfs02/lwp-dss-0001/u7b03/u7b03-dss-0000/ra78zut/Marla_virome/ref/index/1/chr1_index_k13_c14_b1.block
Indexing threads started for block 0-1
Indexing threads finished for block 0-1
Generated Index: 1.782 seconds.
Analyzed Index: 2.814 seconds.
Cleared Memory: 0.385 seconds.
Processing reads in paired-ended mode.
Started read stream.
Started 56 mapping threads.
Note: Coverage capped at 65535; please use the flag 32bit for higher values.
Detecting finished threads: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55
------------------ Results ------------------
Genome: 1
Key Length: 13
Max Indel: 16000
Minimum Score Ratio: 0.56
Mapping Mode: normal
Reads Used: 16237318 (2445148296 bases)
Mapping: 148.781 seconds.
Reads/sec: 109135.83
kBases/sec: 16434.57
Pairing data: pct pairs num pairs pct bases num bases
mated pairs: 97.4414% 7910931 97.5985% 2386428704
bad pairs: 1.0223% 82996 1.0227% 25007664
insert size avg: 324.51
Read 1 data: pct reads num reads pct bases num bases
mapped: 99.0249% 8039494 99.1715% 1212330330
unambiguous: 99.0228% 8039327 99.1695% 1212305191
ambiguous: 0.0021% 167 0.0021% 25139
low-Q discards: 0.0000% 0 0.0000% 0
perfect best site: 79.2000% 6429976 79.3922% 970536645
semiperfect site: 79.2408% 6433291 79.4332% 971037095
rescued: 1.0093% 81944
Match Rate: NA NA 98.4430% 1202221579
Error Rate: 19.9178% 1601301 1.5509% 18939903
Sub Rate: 19.8272% 1594012 0.6345% 7748164
Del Rate: 1.0641% 85552 0.7292% 8905711
Ins Rate: 2.1887% 175959 0.1872% 2286028
N Rate: 0.1302% 10464 0.0061% 74559
Read 2 data: pct reads num reads pct bases num bases
mapped: 98.6765% 8011209 98.8022% 1208045570
unambiguous: 98.6751% 8011097 98.8009% 1208028685
ambiguous: 0.0014% 112 0.0014% 16885
low-Q discards: 0.0000% 1 0.0000% 25
perfect best site: 68.5175% 5562699 68.6520% 839401030
semiperfect site: 68.5566% 5565880 68.6912% 839881042
rescued: 1.3663% 110922
Match Rate: NA NA 98.2873% 1196040747
Error Rate: 30.5078% 2444046 1.7065% 20766421
Sub Rate: 30.3131% 2428452 0.8256% 10046664
Del Rate: 1.2478% 99963 0.7262% 8837337
Ins Rate: 2.0474% 164022 0.1547% 1882420
N Rate: 0.0811% 6497 0.0062% 75739
Reads: 16237318
Mapped reads: 16050703
Mapped bases: 2436426603
Ref scaffolds: 1
Ref bases: 57550
Percent mapped: 98.851
Percent proper pairs: 97.441
Average coverage: 42335.823
Average coverage with deletions: 42290.789
Standard deviation: 6444.762
Percent scaffolds with any coverage: 100.00
Percent of reference bases covered: 100.00
Total time: 154.154 seconds.
So the Average coverage was 42335.823. Is this in the advisable range? What is the best coverage for near-perfect assembly? How do I interpret this?
Thank you in advance
How many multi-fasta sequences are in
scaffolds.fasta
file?Since this is a meta assembly there is no way to know the "truth" in terms of what was actually there to begin with. But you clearly have humongous amount of coverage.
I edited the post. It has only one contig with a length of 57550 bps. It's a viral genome. I am also suspicious of the coverage.
Hmm that seems suspicious that you got a single contig from a metaviral analysis. Did you know there was only one virus in the sample? Since you have millions of reads and 97%+ are aligning it is not unusual to see that high a coverage number.
Thank you for the reply. It was a single viral isolate procedure and we expected to have only one viral species. So Can I go ahead with this assembly?
Is it similar to a known/related virus? Does it appear to contain the entire genome?
Having too much coverage (over sampling) can be an issue for getting good assemblies. There are tools to normalize the data before assembly (like
bbnorm.sh
). So you will need to judge how good the assembly is.I verified the contigs using checkV and this contig is a valid viral complete contigs and the it is also ~80 similarity with the closet reference genome. Sequencing depth is around 30X that why I didn't think of doing normalization before. But Ill check bbnorm.sh
There is no way your sequencing depth is 30x. Just going by aligned number of bases above there are a total of ~2.4 Billion bases aligned. If the genome is only 57500 bases then you end up with 42000x coverage based on sequence data.
That said if the genome matches a known relative and appears complete then you may be fine using the contig you have.