Hi everyone,
I am not exactly new to sequence analysis, but I still feel very green. I have been to a few short courses on how to perform the usual analysis and I feel comfortable with that. Well to a point. I guess the correct way to say that is I no longer think I'm doing it wrong.
My concern is in the generation of the assemble from the reads. In all the readings that I have found and in the short courses, this step seems to be glossed over. I think I am making good assemblies, we had some of the data assembled from another lab and their draft was the same as ours, however it took them a day and it takes me a week.
I am currently working on WGS of dsRNA viruses (reo) with inhouse ion torrent data, my current workflow is to get the reads from the machine, filter out the low quality reads and then use Mira or DNAstar Ngen to de novo (high variability even with closely related strains) the reads into contigs. Then I take the contig fasta file and blast it against NCBI to figure out what contigs are what, place that information into a spreadsheet, and then find the overlaps from the blast data and make my genes.
Is this the correct way to do this kind of assembly? I know it works but it is pretty labor intensive, we are still sequencing more samples and the backlog is getting pretty deep. Any insight on how I could speed this process up?
Hi Mark,
Thank you for your reply, I know that this is a more open-ended question and I really do appreciate the feedback. I suppose the time difference is because I do not have an automated pipeline. This is because my scripting knowledge is lacking. I know a little python and a little R, but other than that I'm not very versed or practiced. I'm a wet bench biologist moving into the in silico world :)
In my workflow now the bottleneck is the joining of contigs to make complete segments. Is this something that can be automated to a point? Like if they are x% similar go ahead and join. Or is this something that is better tweaked in the assembler inputs (right now, all default settings)?
Our reoviruses are presenting themselves with different phenotypes and our project is to sequence several from the different phenotypes to see if there are significant genotype changes that might explain why there are different phenotypes. I believe to answer this question I have to build all of these genomes and annotate them, in order to compare them properly. Is this train of thought correct as well?
It looks like you are left with a number of overlapping contigs after assembly. Try to adjust the parameters such as kmer length and coverage until you're getting full length virus sequence. If that fails, then you might want to use CAP3 to perform the final merging step. Unix shell scripting is suited to automating workflows and pipelines such as these, so I would recommend you develop more skills in that area.
Genotyping the viruses by de novo assembly is certainly the hard way but it is more sensitive to large scale differences compared to the reference sequence.
Genotyping by aligning directly to a library of viral sequences would be faster but you wouldn't be able to identify things that are considerably different from those in the reference sequences. The consensus sequence for each seq run can then be extracted using samtools mpileup.
It comes down to how novel the sequences are compared to known viral sequences and whether you can get the assembly to work reliably for many runs.
Thank you again for the insight. I will work on all of these to see if I can find the process that works best.