Question: Questions about de novo genome assembly from mixed DNA samples
gravatar for niuyw
2.7 years ago by
niuyw30 wrote:


Recently I'm working on a de novo genome assembly project. Because the animal we study is so tiny, we had to pool DNA together from multiple individuals. And we've sequenced these DNA using both PacBio and Illumina. I've assemble the PacBio long reads into contigs, and want to do scaffolding and error correction using short reads. But I have the following two concern:

  • Can I use these short reads to correct the assembly? Since mixed DNA samples were used, how do the error correction tools discriminate "errors" or "individual difference"?

  • If I can do error correction, which one shoud be done first, error correction or scaffolding? I don't know what's the difference.

I have very little experience in this field. Sorry if the question is a bit basic. I'm totally stuck. Any help would be greatly appreciated.

ADD COMMENTlink modified 2.7 years ago by lieven.sterck10k • written 2.7 years ago by niuyw30
gravatar for lieven.sterck
2.7 years ago by
VIB, Ghent, Belgium
lieven.sterck10k wrote:

error correction (polishing) tools will not discriminate between "error" and "genetic difference", so it will be up to you to decide on this. Do you have reason to believe there might be quite some difference between individuals, if so I would be tempted to be careful integrating all the data.

I would go for polishing first , then scaffolding (by first correction you might have higher = better mapping rate of your mate pair data (which I assume you have?) . Vice versa will also work though.

Concerning the difference between them: error correction will try to amend incorrect bases in the assembly (as well as indels etc), so it's on a nucleotide scale precision and will introduce changes to your contig sequences, while scaffolding will order and orient you contigs into scaffolds without change the actual contig sequence.

ADD COMMENTlink written 2.7 years ago by lieven.sterck10k

Thank you for your answer! It's very clear. Now I can proceed. Thank you :)

I've got another question. Maybe I should open a new thread, but you may know the circumstances better. You know, because of the tiny sizes, we used the whole bodies for sequencing, so there are considerable sequence contamination in both PacBio and Illumina data. What are your suggestions about this? Because the estimated coverage of PacBio data is more than 50X, I used long reads to assemble a contig. Then do scaffolding using mate pair data based on this backbone. This is my plan, and now I don't know how to deal with the contamination.

My general idea about this is as follows. First, use (all) raw PacBio data to assemble a contig, and mask those (possible) contaminated regions with 'N', which can be done by Kraken. Second, filter Illumina mate pair data, for both quanlity and contamination. Third, (polish and) scaffold. Do you think it's a good way? I reallly appreciate your replys; they are very helpful. Sorry for my poor English.

ADD REPLYlink written 2.7 years ago by niuyw30

Yes, that is a valid approach.

Concerning the contamination: Not sure how well kraken will work on long reads (or contigs)? regardless of the approach I would indeed also try to remove (in stead of masking with Ns) the contamination on contig level.

ADD REPLYlink written 2.7 years ago by lieven.sterck10k

I found more than 70% contigs were classified into contaminated sequences, but most of them only had a small part (several k-mers) contaminated. There would be a few sequences left if removing the whole contigs, so I want to mask the contaminated regions with 'N'. Maybe I shoud try remove contamination using raw PacBio data. I will try both. Thank you for your reply!

ADD REPLYlink written 2.7 years ago by niuyw30

Unless your assembly is massively making chimeric contigs this should not happen. Usually (the "theory") the whole bit of DNA is contamination or it is not.

I start to suspect that perhaps your contamination criteria are too lenient and you're ending up with lots of false positives.

Are the illumina data from the same biological samples as the PacBio? If so you could check if you have the same level of contamination in the illumina data as in the pacbio contigs

ADD REPLYlink written 2.7 years ago by lieven.sterck10k

The latest version of Kraken with default parameters was used, and the assembly was generated by Canu. According to the results of Kraken, about 75% contigs were classified into archaea, bacteria, or viral. I was also a bit shocked by this. The contaminated regions accounted for about 0.65% of the total lengh (8371218/1281768139).

I will run Kraken using PacBio data. But the PacBio data and the Illumina data were from different batches of insects, both batches contained a dozen bugs.

ADD REPLYlink written 2.7 years ago by niuyw30

I'm not very familiar with the output of Kraken but the 75% does not look to align with the 0,6% of the total length (or it should be all the very small contigs?) .

Not sure it it's possible but from what you write I suggest you do some post kraken filtering and remove only those contigs that for the biggest part are marked as contamination. It looks to me like there might be plenty of contigs that only have a small number of bases assigned to be contamination. If only 50bp on a contig of 100kb are reported by kraken I would not consider that contig to be a contamination. Remember that also genuine eukaryote contigs can show some similarity to non-eukaryote stuff

ADD REPLYlink written 2.7 years ago by lieven.sterck10k

Yes, I agree with you. Because I need to set a threshold to filter out contaminated contigs, I prefer to filter on the raw data now. Thank you very much~

ADD REPLYlink written 2.7 years ago by niuyw30
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1363 users visited in the last hour