Question: single eukaryote whole genome de novo assembly
0
gravatar for drew.huangtao
5.5 years ago by
Australia
drew.huangtao0 wrote:

Hi everyone

I am doing single cell genomics. However, I am a newbies on bioinformatics.

My sample is single picoeukaryotes from the ocean. I did MDA followed by NGS using Hiseq 2500 PE 150 bp with Nextera libraries preparation.

Here, my questions is
1) In order to get better assemblies, which trimming tool do you recommend for trim sequences? 

2) For single eukaryotes do novo assembly, I did not find suitable assemblers.
Could you recommend some for me? P.S. I tried several assemblers, like, IDBA_UD, SPADes, however, I can't get long contigs, N50 only around 1500bp.

3) Could you recommend some experts or research center who are professional on de novo assembley? 

4) BTW, the genome size of my sample is around 20 Mbp to 40 Mbp according to reference paper. 

All my PhD project is stuck here, I wish you could help me out.
Thank you very much!

Cheers

sequence next-gen assembly • 2.0k views
ADD COMMENTlink modified 5.5 years ago by Brian Bushnell17k • written 5.5 years ago by drew.huangtao0
0
gravatar for Brian Bushnell
5.5 years ago by
Walnut Creek, USA
Brian Bushnell17k wrote:

For MDA-d single cells, I recommend BBDuk for trimming (both adapter and quality) and Spades for assembly.  But, 20 to 40 Mbp might be big for Spades; we usually use it on bacteria.  MDA'd single cell typically has very uneven coverage; depending on the degree of nonuniformity, it can be helpful to normalize the data (with for example BBNorm).  In my testing this often improves Spades assemblies (particularly if the depth is very high), and always improves Velvet assemblies.  Also, with 150bp reads, be sure that you are using long kmers - particularly, Spades defaults to a max of 55, which is too short when you have good coverage.

If your inserts are short enough to overlap, it may help to first merge the reads with BBMerge, then assemble.  That will allow you to use a longer kmer, and will reduce the error rate in the reads.

Do related organisms seem to have a very high repeat content, or do your reads have extreme GC content?  It could be that the bad assemblies are simply inherent to the organism rather than the methodology.  Also, what are your coverage, insert-size, and quality distributions like?

Posting fastQC results may help.

ADD COMMENTlink modified 5.5 years ago • written 5.5 years ago by Brian Bushnell17k

Hi 

Thank you so much!

Do related organisms seem to have a very high repeat content, or do your reads have extreme GC content?  

According to Fastqc results, the GC content is around 30%.

 what are your coverage, insert-size, and quality distributions like?

when I sequenced, I chose 100× coverage. and the insert size is arount 379 bp to 1 kp (exclusive adapters). The following is the fastqc report summary.

File type    Conventional base calls
Encoding    Sanger / Illumina 1.9
Total Sequences    18217036
Filtered Sequences    0
Sequence length    151
%GC    27

I am not sure how to post FASTQC report here. could you teach me or could I send you an email?

Thanks

ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by drew.huangtao0

The best way is to host it somewhere (such as on Google drive) and post the link.  Google drive allows you to make links public.

100x coverage is not very much for single-cell because the coverage is highly biased.  We usually target at least 2000x; and doing multiple single cells of the same organism, if possible, can get you a much more complete assembly.  Also, Illumina does not sequence efficiently with inserts over 800bp, to my knowledge.  With 151bp reads, you should trim the last 1bp from all reads, regardless of quality score, as it is inaccurate.  And 27% GC...  that's pretty extreme.  It makes everything more difficult.  So, at least, compare your N50 to the N50 of other single-cell assemblies of ~27% GC organisms, which will make it look better :)

ADD REPLYlink written 5.5 years ago by Brian Bushnell17k

For same samples the GC content is around 50%

Hopefully this time you could see the fastqc report.

Thanks

https://drive.google.com/file/d/0B-TDr-LXO3YuRXdlSjdnMjBxaEE/view?usp=sharing

https://drive.google.com/file/d/0B-TDr-LXO3YuX2IzakViZlY1ejA/view?usp=sharing

ADD REPLYlink written 5.5 years ago by drew.huangtao0

Both read GC histograms have multiple prominent peaks, so it's possible there is contamination...  though it's also possible those are organelles or other non-contaminant sequence.  If it was contamination, that may drop your coverage of the primary low enough to cause major problems in assembly.  I suggest you sort your assembled contigs by GC content and BLAST them against some large databases (nt, refseq, etc) to see what they hit.

It's hard for me to interpret the kmer enrichment graphs; I've never understood what the Y-axis means in those plots.  But, it looks possible that you have adapter contamination due to short insert sizes, although that didn't show up in the base frequency histogram.  Regardless, I suggest mapping against an assembly to get an estimate of the actual insert size distribution.  You can also generate that by running BBMerge like this:

bbmerge.sh in1=read1.fq in2=read2.fq ihist=ihist.txt reads=1m

...which will only take a few seconds and tell you what you need to know, which is if you have a lot of inserts shorter than read length.

ADD REPLYlink written 5.5 years ago by Brian Bushnell17k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1904 users visited in the last hour