I have assembled many chloroplast genomes and this is something we are doing on a pretty large scale now to complement other phylogenomics approaches. First, I don't think 100k reads will give you enough coverage to assemble a complete genome, but there are other good options. It is likely that you have other sequence data such as WGS reads or some kind of sequence capture data. In my experience, all sequence data sets (in plants), whether from targeted or shotgun approaches, will contain chloroplast and mitochondria fragments. You should be filtering this data anyway as it is a source of contamination, but these filtered reads can actually be combined with other data and used to assemble the chloroplast genome.
One very important step is to calculate the estimated coverage of the genome you are assembling, and you can get an estimate of the coverage by mapping to a closely related species. Picking the appropriate coverage cutoff will lead to a more complete and contiguous assembly. The most common mistake I see people making is trying to assemble with crazy high coverage (>1000X), and this does not give good results (and takes a longer to execute).
I don't think it is possible to resolve the IR regions de novo (at least, in my experience), but it can be done by using a reference.
My advice would be assemble your genome with Newbler or MIRA as Leonor suggested, then use ABACUS to order your contigs relative to the reference. That will help you fill in the gaps and transfer annotations. The caveat with this approach is that you need a reference from a closely related species because you are assuming the same order. Depending on the species being compared, this is probably a safe assumption because chloroplast genomes evolve more slowly than nuclear genomes.
Is this the ABACAS software you recommend? http://www.ncbi.nlm.nih.gov/pubmed/19497936 Thanks.
Yep, that's the one. The project is on sourceforge.
I have 3GB data for one chloroplast genome, is it too big? Could you tell me how big should I sequence?
I'm not sure if you mean 3 gigabases or a 3 gigabyte file, but a chloroplast genome is typically about 150 kb. You can use that as a guide to figure out how much coverage you have. If your coverage is really high (e.g., >200X), then I would down sample the data.
I have 3GB data for one sample, is it too high? Does it means I should drop (randomly?) extra data to get the good assembly results?
I will continue to sequence some chloroplast genomes, could you please tell how how many data should I get for each genome, like 300M?
Hope such lots of questions would not bother you.
No worries, I don't mind answering questions. Yes, you should down sample the data randomly to achieve the desired coverage. Think in terms of X-coverage of the genome because that makes more sense, and in that case, I would try a few assemblies between 60X and 200X to see which is best. Likely, the best assembly will be in that range.
How would you down a sample? Digital normalization using C=60? Or randomly sampling X number of reads? Thank you.
For reference, I created an application called Chloro to make the process of assembling chloroplast genomes a bit easier. There are some nice features and it worked pretty well for us. There are also a couple of things I would like to improve if I have the need to do this work again in the future, but I'm too busy right now to tinker with performance/accuracy of side projects. Maybe it will help someone.