Hi everyone
I am doing single cell genomics. However, I am a newbies on bioinformatics.
My sample is single picoeukaryotes from the ocean. I did MDA followed by NGS using Hiseq 2500 PE 150 bp with Nextera libraries preparation.
Here, my questions are
- In order to get better assemblies, which trimming tool do you recommend for trim sequences?
- For single eukaryotes do novo assembly, I did not find suitable assemblers. Could you recommend some for me? P.S. I tried several assemblers, like, IDBA_UD, SPADes, however, I can't get long contigs, N50 only around 1500bp.
- Could you recommend some experts or research center who are professional on de novo assembley?
- BTW, the genome size of my sample is around 20 Mbp to 40 Mbp according to reference paper.
All my PhD project is stuck here, I wish you could help me out.
Thank you very much!
Cheers
Hi
Thank you so much!
Do related organisms seem to have a very high repeat content, or do your reads have extreme GC content?
According to Fastqc results, the GC content is around 30%.
What are your coverage, insert-size, and quality distributions like?
When I sequenced, I chose 100xcoverage. and the insert size is arount 379 bp to 1 kp (exclusive adapters). The following is the fastqc report summary.
I am not sure how to post FASTQC report here. could you teach me or could I send you an email?
Thanks
The best way is to host it somewhere (such as on Google drive) and post the link. Google drive allows you to make links public.
100x coverage is not very much for single-cell because the coverage is highly biased. We usually target at least 2000x; and doing multiple single cells of the same organism, if possible, can get you a much more complete assembly. Also, Illumina does not sequence efficiently with inserts over 800bp, to my knowledge. With 151bp reads, you should trim the last 1bp from all reads, regardless of quality score, as it is inaccurate. And 27% GC... that's pretty extreme. It makes everything more difficult. So, at least, compare your N50 to the N50 of other single-cell assemblies of ~27% GC organisms, which will make it look better :)
For same samples the GC content is around 50%
Hopefully this time you could see the fastqc report.
Thanks
https://drive.google.com/file/d/0B-TDr-LXO3YuRXdlSjdnMjBxaEE/view?usp=sharing
https://drive.google.com/file/d/0B-TDr-LXO3YuX2IzakViZlY1ejA/view?usp=sharing
Both read GC histograms have multiple prominent peaks, so it's possible there is contamination... though it's also possible those are organelles or other non-contaminant sequence. If it was contamination, that may drop your coverage of the primary low enough to cause major problems in assembly. I suggest you sort your assembled contigs by GC content and BLAST them against some large databases (nt, refseq, etc) to see what they hit.
It's hard for me to interpret the kmer enrichment graphs; I've never understood what the Y-axis means in those plots. But, it looks possible that you have adapter contamination due to short insert sizes, although that didn't show up in the base frequency histogram. Regardless, I suggest mapping against an assembly to get an estimate of the actual insert size distribution. You can also generate that by running BBMerge like this:
...which will only take a few seconds and tell you what you need to know, which is if you have a lot of inserts shorter than read length.