Question: problem in estimatining BAC size using k-mer method
0
gravatar for gangireddy
2.7 years ago by
gangireddy160
gangireddy160 wrote:

Hi people,

I am trying to assemble BAC clone sequence from pacbio. The assemble using Celera Assembler and canu are both resulting one contig assembly but with a difference in length of 15kb.

so, in order to estimate the target size. I followed the link below:

K-mer analysis and genome size estimate

and the graph obtained is as follows with two peaks and it is not giving following poisson distribution. I am confused which peak to choose for calculating the estimate target size. either of the peaks give two completely different target genome size estimates.

assembly • 732 views
ADD COMMENTlink modified 2.7 years ago by SES8.1k • written 2.7 years ago by gangireddy160

Can you link to the image of the distribution? If you're dealing with a diploid, you should probably use the second peak, but seeing the distribution would clarify things a lot. Also, what organism is it for?

ADD REPLYlink written 2.7 years ago by Brian Bushnell16k

image link

it is the sequence of B.mori BAC

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by gangireddy160
1

That does not really look like 2 peaks to me, but rather, one jagged peak. Normally, for one peak, the genome size is the area under the curve excluding error kmers. But in this case there is no clear distinction. I agree with other comments that this is not really a good scenario to try kmer-based genome-size estimation.

ADD REPLYlink written 2.7 years ago by Brian Bushnell16k
0
gravatar for SES
2.7 years ago by
SES8.1k
Vancouver, BC
SES8.1k wrote:

I would personally forget about the k-mer approach. Having worked with BACs for years, in the wet lab and computationally, I would simply look up the library information. You should be able to find the average insert size for your library, and if there is a physical map you may be able to find information about the clone. It depends on how/who made the library, but there will be quite large differences between BACs regardless of the assembler. That first step will tell you if you are in the ballpark in terms of assembly size.

What you are showing is also expected, which is differences between assemblers. I don't think those numbers are unexpected. You just have to decide which is more likely correct based on the data (bearing in mind those tools were designed for different purposes), the biology, and the assembly statistics. The classic question of what is "better" kind of depends on what you want to do. A larger N50 or total length isn't necessarily more correct. To me, that statistic on the length doesn't mean very much without some context. Sorry if that is vague but I can be more specific if you'd like to provide more information.

ADD COMMENTlink written 2.7 years ago by SES8.1k

What if BAC is the "genome"? target size reference keeps things vague though.

ADD REPLYlink written 2.7 years ago by genomax63k

Can you elaborate please? I'm not sure what you are suggesting with either part of the comment. The sequence/size of a BAC vector is already known. What you are trying to determine is the insert size (of the clone).

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by SES8.1k
1

I was thinking that OP is trying to use k-mer information alone to estimate the size of the "genome" (which in this case would be whole BAC). It is possible that my thinking is completely off target.

ADD REPLYlink written 2.7 years ago by genomax63k

No worries, the post is not really a computational/bioinformatics topic. When you do a BAC prep/digest to extract the clone the common approach is to run it out on a gel for QC before sequencing. Whoever picked the clone would have done that. The smart approach would be to gather this info instead of trying computational approaches IMO.

ADD REPLYlink written 2.7 years ago by SES8.1k

The average size of library is 168 kb and the assembled contigs have sizes of 219650 && 234947. I don't think it was run on gel as the sequences also contain e.coli sequences which I have removed mannually.

ADD REPLYlink written 2.7 years ago by gangireddy160
1

All BAC data contains e. coli initially, that is the vector. This is the main reason to know the BAC library/clone information (to clean up the data). Sizing the insert on a gel would be done before the library is made, which is long before the sequencing is done (you can search the web for protocols). Those assembly size ranges are normal in my experience, and that looks like a nice BAC library! The typical approach is to try to "finish" the BAC as much as possible rather than focusing on the exact size.

ADD REPLYlink written 2.7 years ago by SES8.1k

we are trying to do a denovo assembly of a chromosome using BAC library. so, if BAC asslemblies are not upto mark then the final assembly might have more problems. the difference of size is around 15 kb. this is what worries me.

ADD REPLYlink modified 2.7 years ago • written 2.7 years ago by gangireddy160
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1223 users visited in the last hour