Assemble Amplicons from several samples
1
0
Entering edit mode
3.9 years ago
FGV ▴ 130

Dear all,

I'm trying to analyze amplicon NGS data from several samples, and not quite sure what would be the best approach. I have 100 bp read data (both paired and single end) from several (~100) diploid individuals. Since there is no reference genome available, I thought of assembling everything together into contigs and then use this a "reference" to map each individual to. I've tried several programs but none managed to do a decent job but, since this dataset is not the typical NGS data, I am not sure if it is appropriate for these programs. Some challenges might be:

• higher diversity, since I need to assemble several different individuals (in some cases, maybe even closely related species)
• very high coverage, since I'm pooling several dozens of individuals
• some programs perform an error correction; since it is based on kmer frequency, is it appropriate to this kind of data?

Also, should I:

• assemble all individuals together, or each of them separately and merge the assemblies afterwards? If the latter, any suggestion on how it could be done? If the former, pool all individuals or just a small subset (e.g. 5 to 10)?
• go for kmer or overlap based assemblers?
• remove completely identical reads or leave them to give more support to the contigs?

What programs do you recommend for this kind of data? Any hints/tips/ideas?

thanks,

NGS Assembly next-gen sequencing • 1.1k views
0
Entering edit mode

I think most importantly you need to tell us what this data is? Bacteria/Virus/Human.

What are you trying to achieve. This will help to help you

0
Entering edit mode

For now I was thinking of diploid non-model animal organisms.

0
Entering edit mode

For now I was thinking of diploid non-model animal organisms.

Hmm. Have you or have you not done the experiment?

Were the sequencing libraries made from individual specimens or from a pool (100 libraries is a lot of work)?

How similar do you expect the individual genomes to be? Is this expected to be a particularly "special" genome (repeats etc)? What is the expected size of the genome and how much data do you have (giga bases, theoretical fold coverage)?

0
Entering edit mode

The data has already been generated but I was trying to keep it broad/general so the method could be applied to other species.

Sequencing libraries were made per individual (a lot of work indeed but luckily I am just analyzing the data).

Right now I have data from one species but different populations, so not that divergent. Genome size around 1-2 Gb, no special things, but only ~10k loci are targeted (forgot to say it is amplicon data). Coverage per locus around 10-20x

0
Entering edit mode

forgot to say it is amplicon data

That is a critical bit of information (please add to the original post by editing it)!

Do you know if the loci are gene centric or just random areas around the genome? What is the range of sizes for these amplicons?

You may want to give tadpole.sh from BBMap suite a try. It does well with assembly of small genomes (think viral). It may work in your case as well. @h.mon has a more formal recommendation in his answer below.

0
Entering edit mode

If you didn't do the sequencing yet, I would advise you against this strategy, and suggest instead you do proper sequencing (deep illumina paired end + mate pairs and / or long reads) of one "reference" individual - if you can get an haploid genome with some breeding / genetic trick even better.

Then consider an appropriate design for any downstream experiments you might want to do.

edit: from the question + comments, it is not entirely clear if you already have the data or not. Also, what kind of data you already have (platforms, library types, coverage)? What is the organism? Please edit your post and clarify these points.

0
Entering edit mode

I have 100 bp read data (both paired and single end) from several (~100) individuals.

I think the sequencing has been done but @FGV does not want to tell us what organism the data is from :)

0
Entering edit mode

I did not design the experiment, just helping with the analyses, but think the idea is to use the same data twice: do a "reference" for the relevant parts of the genome (why do a full ref genome if you only have amplicon data?) and have pop data to do analyses.

Illumina single and paired end, 10-20x coverage

As for the species, it is some bird, but honestly i dont really care. Id like to make it as general as possible so other people can use in their own projects (and maybe even publish it as a method)

0
Entering edit mode

I am sorry to sound harsh (rude) here, the data you are talking about seems a complex dataset. You cannot make a generic pipeline for something which has not been classified as a genome/new species.

Also, my apologies but I would encourage you to care about the species you are working with. Every species has its own characteristics and caring about the species is important.

My 2 cents!

0
Entering edit mode

I see your point but I'd like to avoid making a super specific pipeline that only works for a single species or under a very specific set of circumstances. That is why I did not focus on any specific species...

0
Entering edit mode
3.9 years ago
h.mon 32k

If this is amplicon data, your title is a bit (just a little bit) misleading. Not to worry, though.

I have limited but good experience with HybPiper for amplicon data. For amplicon, it seems to me it makes sense to assemble each individual separately, and then align and evaluate how best to choose a "reference".

0
Entering edit mode

I'll give it a try to HybPiper but, if I assemble each individual separately, do you have any suggestion on how to best align/merge/assemble all individual assemblies together?

0
Entering edit mode

These regions would assemble into 10,000 amplicons (hopefully) though as you wade through these you will find that the reality will not match that (would be curious to see how many you end up with, you did not answer my question about the size of these amplicons).

Once you have sets of amplicons from the individuals you can align them (multiple sequence alignment) and then generate a consensus. You may have 100 individuals but there is only one genome to represent the species.