Question

The hard way?

0

Entering edit mode

9.6 years ago

skbrimer ▴ 740

Hi everyone,

I am not exactly new to sequence analysis, but I still feel very green. I have been to a few short courses on how to perform the usual analysis and I feel comfortable with that. Well to a point. I guess the correct way to say that is I no longer think I'm doing it wrong.

My concern is in the generation of the assemble from the reads. In all the readings that I have found and in the short courses, this step seems to be glossed over. I think I am making good assemblies, we had some of the data assembled from another lab and their draft was the same as ours, however it took them a day and it takes me a week.

I am currently working on WGS of dsRNA viruses (reo) with inhouse ion torrent data, my current workflow is to get the reads from the machine, filter out the low quality reads and then use Mira or DNAstar Ngen to de novo (high variability even with closely related strains) the reads into contigs. Then I take the contig fasta file and blast it against NCBI to figure out what contigs are what, place that information into a spreadsheet, and then find the overlaps from the blast data and make my genes.

Is this the correct way to do this kind of assembly? I know it works but it is pretty labor intensive, we are still sequencing more samples and the backlog is getting pretty deep. Any insight on how I could speed this process up?

Assembly • 1.8k views

ADD COMMENT • link updated 2.3 years ago by Ram 44k • written 9.6 years ago by skbrimer ▴ 740

Ram · Accepted Answer · 2015-04-26

1

Entering edit mode

9.5 years ago

mark.ziemann ★ 1.9k

Hi skbrimer,

Why does your analysis take so long compared to the other lab? Is it the lack of compute resources or the lack of automated pipeline?

If the latter, you need to choose the right tools to do the assembly faster and more accurately. Select tools that can work multi-threaded/parallel and without too much memory footprint. Abyss is old but still a good choice for assembler. Make the workflow more automated. Script each part perfectly. Then chain it all together. Use cron to schedule jobs including on the weekend if you need.

If compute power is an issue, get your hands on more compute resources (CPU & memory). Get a bigger server, connect together computers to make a cluster and use tools like GNU parallel and qsub to distribute the workload over the system.

You should be able to perform sequence quality control, assembly, blast and generate pdf reports for each seq run in an automated fashion and reduce the "busy work" of cutting/pasting spreadsheets, etc.

ADD COMMENT • link updated 2.3 years ago by Ram 44k • written 9.5 years ago by mark.ziemann ★ 1.9k

0

Entering edit mode

Hi Mark,

Thank you for your reply, I know that this is a more open-ended question and I really do appreciate the feedback. I suppose the time difference is because I do not have an automated pipeline. This is because my scripting knowledge is lacking. I know a little python and a little R, but other than that I'm not very versed or practiced. I'm a wet bench biologist moving into the in silico world :)

In my workflow now the bottleneck is the joining of contigs to make complete segments. Is this something that can be automated to a point? Like if they are x% similar go ahead and join. Or is this something that is better tweaked in the assembler inputs (right now, all default settings)?

Our reoviruses are presenting themselves with different phenotypes and our project is to sequence several from the different phenotypes to see if there are significant genotype changes that might explain why there are different phenotypes. I believe to answer this question I have to build all of these genomes and annotate them, in order to compare them properly. Is this train of thought correct as well?

ADD REPLY • link updated 2.3 years ago by Ram 44k • written 9.5 years ago by skbrimer ▴ 740

0

Entering edit mode

It looks like you are left with a number of overlapping contigs after assembly. Try to adjust the parameters such as kmer length and coverage until you're getting full length virus sequence. If that fails, then you might want to use CAP3 to perform the final merging step. Unix shell scripting is suited to automating workflows and pipelines such as these, so I would recommend you develop more skills in that area.

Genotyping the viruses by de novo assembly is certainly the hard way but it is more sensitive to large scale differences compared to the reference sequence.

Genotyping by aligning directly to a library of viral sequences would be faster but you wouldn't be able to identify things that are considerably different from those in the reference sequences. The consensus sequence for each seq run can then be extracted using samtools mpileup.

It comes down to how novel the sequences are compared to known viral sequences and whether you can get the assembly to work reliably for many runs.

ADD REPLY • link updated 2.3 years ago by Ram 44k • written 9.5 years ago by mark.ziemann ★ 1.9k

0

Entering edit mode

Thank you again for the insight. I will work on all of these to see if I can find the process that works best.

ADD REPLY • link updated 2.3 years ago by Ram 44k • written 9.5 years ago by skbrimer ▴ 740