flye assembly stuck on 'Aligning reads to the graph'
0
2
Entering edit mode
6 weeks ago
jleehan ▴ 90

I've been working on genome assemblies for strains of Bacillus subtilis with 10kb pacbio reads. I have about 1.1 million 10kb reads for each strain, which is far more coverage than I need to get a good assembly, so I randomly sampled 100,000 reads for each strain and I use those 100,000 reads for the assembly. One of the strains I'm working with was evolved during a competition experiment, rather than constructed as the other two were. I've had no problem using flye to complete the assemblies for the two constructed strains. However, when I try to assemble the .fastq sequences for the evolved strain, the assembly will always get stuck on one particular step, 'Aligning reads to the graph'. The graph in question should be the repeat graph that the program builds a few steps prior. I've never been able to get past this step, it will always time out after about 12 hours (because I only allotted 12 hours with our job managing system).

I know these assemblies can take a lot of time, however my other assemblies only took about 70 minutes to complete. And the particular step, 'Aligning reads to the graph' only takes about 5 minutes according to the logs from the flye read out. They also provide a progress indicator during that step where every 10% of the process is noted, and I've never even seen the 0% indicator for the assembly of the evolved strain.

Does anyone know what my problem could be here or how I could work around it? I tried sampling more reads (200,000) to use for the assembly and that gave me the same result. I'm currently running a job where I use smaller sampling of reads (50,000; which should still plenty of coverage, >80X) but I'm still not sure if that will work.

pacbio genome longread assembly flye • 204 views
ADD COMMENT
1
Entering edit mode

Have you tried to align the data from the "evolved" strain to the parent? Does the data cover the entire parent genome? What kind of changes are you expecting in your "evolved" genome?

Without access to your data I am not sure if you are going to get a logical answer for the specific question you have. Perhaps you still have way too much data, just as a speculation.

ADD REPLY
1
Entering edit mode

The assembly with 50k reads instead of 100k actually just finished about 5 minutes ago. And according to the assembly statistics, it had a mean coverage of 80X. So like you said on my last post and right here, too much data can screw up the assemblies. I just got to witness that first hand apparently.

I'd previously sequenced the evolved and parental strains with short reads and it was giving us really wonky results which is why I'm doing this again with long reads so we can see if there were any large scale genomic changes.

ADD REPLY
0
Entering edit mode

That is good to know. You can also make use of those short reads to make sure there are no unexpected indels etc in your long read assemblies since short reads are going to be much more accurate.

ADD REPLY
0
Entering edit mode

Do you have any recommended methods/programs for using short reads for verifying the assembly of long reads?

ADD REPLY

Login before adding your answer.

Traffic: 1916 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6