Hi all,
I am working on a virus discovery and virome profiling project. I am interested in quantifying viruses in my samples. So far, my idea was to... 1) Assemble reads into contigs 2) Use CD-HIT to group together contigs with >=95% identity 3) Use metabat2 to bin this "nonredundant" pool of contigs, in order to have a binned and nonredundant list of putative viruses to which I can map my raw reads using a program like Salmon. 4) Normalize read counts after using metagenomeseq, visualize in MicroViz.
Basically, I thought it would be best to create this non redudundant grouping of contigs for accurate quantification (if there were 10 contigs all belonging to one virus, and Salmon tried to quantify all these, I would have to manually add all these counts together downstream).
I am a bit confused what the best practice would be after binning my contigs though, for 2 reasons:
- My first idea was to simply combine all contigs in a single bin into one very long contig, so I was capturing the most genomic information. Or is it best to simply select the longest contig of this bin?
- How is it best to handle separate bins that come from the same organism? For example, I have one bin that corresponds to virus 1's RNA polymerase, and one bin that corresponds to virus 1's coat protein. Would I have to manually combine these counts later to get the "total counts"?
Thank you