I'm trying to analyze amplicon NGS data from several samples, and not quite sure what would be the best approach. I have 100 bp read data (both paired and single end) from several (~100) diploid individuals. Since there is no reference genome available, I thought of assembling everything together into contigs and then use this a "reference" to map each individual to. I've tried several programs but none managed to do a decent job but, since this dataset is not the typical NGS data, I am not sure if it is appropriate for these programs. Some challenges might be:
- higher diversity, since I need to assemble several different individuals (in some cases, maybe even closely related species)
- very high coverage, since I'm pooling several dozens of individuals
- some programs perform an error correction; since it is based on kmer frequency, is it appropriate to this kind of data?
Also, should I:
- assemble all individuals together, or each of them separately and merge the assemblies afterwards? If the latter, any suggestion on how it could be done? If the former, pool all individuals or just a small subset (e.g. 5 to 10)?
- go for kmer or overlap based assemblers?
- remove completely identical reads or leave them to give more support to the contigs?
What programs do you recommend for this kind of data? Any hints/tips/ideas?