Dear All,
I got the viral genome assembly using metaviral spade. Since this is a relatively new genome, I don't have any closest proper reference. So I mapped my reads back to contigs to know the coverage using bbmap and got the following output:
java -ea -Xmx21819m -Xms21819m -cp /Tools/bbmap/current/ align2.BBMap build=1 overwrite=true fastareadlen=500 ref=/7_new/contigs.fasta in=200_forward_paired_vulgatus_clean.fastq in2=200_reverse_paired_vulgatus_clean.fastq covstats=constats_AvS7_all.txt covhist=covhist_AvS7_all.txt basecov=basecov_AvS7_all.txt bincov=bincov_AvS7_all.txt t=200
Executing align2.BBMap [build=1, overwrite=true, fastareadlen=500, ref=/7_new/contigs.fasta, in=200_forward_paired_vulgatus_clean.fastq, in2=200_reverse_paired_vulgatus_clean.fastq, covstats=constats_AvS7_all.txt, covhist=covhist_AvS7_all.txt, basecov=basecov_AvS7_all.txt, bincov=bincov_AvS7_all.txt, t=200]
Version 39.01
Set threads to 200
Retaining first best site only for ambiguous mappings.
No output file.
NOTE: Deleting contents of ref/genome/1 because reference is specified and overwrite=true
NOTE: Deleting contents of ref/index/1 because reference is specified and overwrite=true
Writing reference.
Executing dna.FastaToChromArrays2 [/7_new/contigs.fasta, 1, writeinthread=false, genscaffoldinfo=true, retain, waitforwriting=false, gz=true, maxlen=536670912, writechroms=true, minscaf=1, midpad=300, startpad=8000, stoppad=8000, nodisk=false]
Set genScaffoldInfo=true
Writing chunk 1
Set genome to 1
Loaded Reference: 0.004 seconds.
Loading index for chunk 1-1, build 1
No index available; generating from reference genome: 7/ref/index/1/chr1_index_k13_c13_b1.block
Indexing threads started for block 0-1
Indexing threads finished for block 0-1
Generated Index: 1.765 seconds.
Analyzed Index: 2.695 seconds.
Cleared Memory: 0.347 seconds.
Processing reads in paired-ended mode.
Started read stream.
Started 200 mapping threads.
Detecting finished threads: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199
------------------ Results ------------------
Genome: 1
Key Length: 13
Max Indel: 16000
Minimum Score Ratio: 0.56
Mapping Mode: normal
Reads Used: 9360574 (1409409714 bases)
Mapping: 46.475 seconds.
Reads/sec: 201409.01
kBases/sec: 30325.90
Pairing data: pct pairs num pairs pct bases num bases
mated pairs: 5.3164% 248823 5.3267% 75074960
bad pairs: 0.0528% 2471 0.0526% 740736
insert size avg: 305.92
Read 1 data: pct reads num reads pct bases num bases
mapped: 5.4125% 253319 5.4169% 38209925
unambiguous: 5.4124% 253316 5.4169% 38209476
ambiguous: 0.0001% 3 0.0001% 449
low-Q discards: 0.0000% 0 0.0000% 0
perfect best site: 4.2656% 199644 4.2721% 30134879
semiperfect site: 4.2671% 199711 4.2735% 30144721
rescued: 0.1074% 5025
Match Rate: NA NA 98.1200% 37791543
Error Rate: 20.7722% 52620 1.8701% 720264
Sub Rate: 20.6234% 52243 0.7863% 302859
Del Rate: 1.6079% 4073 0.7938% 305730
Ins Rate: 3.2378% 8202 0.2899% 111675
N Rate: 0.6245% 1582 0.0100% 3848
Read 2 data: pct reads num reads pct bases num bases
mapped: 5.3925% 252385 5.4043% 38047930
unambiguous: 5.3925% 252383 5.4043% 38047631
ambiguous: 0.0000% 2 0.0000% 299
low-Q discards: 0.0000% 0 0.0000% 0
perfect best site: 3.7263% 174402 3.7388% 26322053
semiperfect site: 3.7277% 174465 3.7400% 26331016
rescued: 0.1078% 5046
Match Rate: NA NA 92.4335% 37564145
Error Rate: 30.7558% 77623 7.5557% 3070576
Sub Rate: 30.5129% 77010 0.9457% 384336
Del Rate: 1.6744% 4226 6.3761% 2591184
Ins Rate: 2.9796% 7520 0.2339% 95056
N Rate: 0.3598% 908 0.0108% 4393
Reads: 9360574
Mapped reads: 505703
Mapped bases: 76763410
Ref scaffolds: 3
Ref bases: 124837
Percent mapped: 5.402
Percent proper pairs: 5.316
Average coverage: 614.909
Average coverage with deletions: 632.237
Standard deviation: 952.759
Percent scaffolds with any coverage: 100.00
Percent of reference bases covered: 100.00
Total time: 51.659 seconds.
The overall mapping is only 5% and but the average coverage is 614. For other samples also, I got the very low Percent mapped and high coverage. Does this mean the assembly is not good?. or how do I improve the results? Thank you in advance
Thank you for the reply. I was sceptical since the read mapping is only 5.402% even though the coverage is high. I filtered the host genome before assembly. So there is no presence of the host.
It has three contigs.
But our Study was a single viral isolate procedure and we expected only one viral species. Would it be ok even if I get the low read mapping percentage with high coverage?. What could be the reasons with very low read mapping In the above sample, can I consider Contig 1 as the true positive one?
Contig 1 - possibly, probably. But you're probably a virologist so know much more than us bioinformaticians which contig and genes are necessary for your virus.
The rest of the reads are, as I said, likely some form of contamination I'd assume (your isolate is not pure). Have a look at metagenomics tools for read alignment, centrifuge or kraken are easy enough.
Alternatively, your de novo assembly might not be complete (eg only 10% of the genome has been constructed).
Thank you so much. It is very helpful