I have used Illumina GA II to sequence two pairend DNA libraries,
one were sequenced with forward reverse direction, -500 bp insert size (LIB1 FR)
the other were sequenced with reverse forward direction, -2kb insert size ,this library was built after circulation. (LIB2 RF)
When I assemly the reads, I got curious results. If I did not reverse complement the second library, I got lower N50 values, but with more reads could be assemblied. (N50 25k, 480M scaffolds/contigs were assembled)
When I reverse complement the second librayry, I got higher N50 values, buit with many reads that could not be assemblied. (N50 46k, only 360M scaffolds/contigs were assembled)
I though the second setting (reverse complement LIB2 reads) was right, but how could the wrong setting assembled more contigs/scaffolds?
The assembler i used is SOAPdenovo.
How bad is the contamination? do you have a histogram to post? also, make sure that in SOAPdenovo configuration file, set asm_flags=2 for the 2k library.
Hi, 480/360 actually means 480MB/360MB - the total length of contigs and scaffolds after assembly. sorry for not clarify.
The situation is that the wrong setting get less contigs, but the right setting get many more contigs. setting FR - i got ~100k contigs and scaffolds in total, N50 25k, N90 4.5K setting RF - i got ~290k contigs and scaffolds in total, N50 46k, N90 280bp
It seems likely that, under the right setting, lots of mate -pair information could not be used by the assembler, thus too many short contigs remained as contigs.
I have checked the data, 2k library have two peaks of insert size, (300bp, 2.4kb). If I got many contaminations in 2kb library, could these datas be used for assemlby, or I should rebuild and resequencing the 2kb libraries?
Can you give me the number of scaffolds only? in both settings?
hi, Setting for FR: 32319 scaffolds setting for RF: 21301 scaffolds
I don't know how to post histograms on to this forum. ~25% mate pairs come with insert size 1~400bp.
the assembled size of scaffolds? in both settings? but your contamination doesn't sound like a huge problem though. I have got a few of my own libs 50% contaminated.
FR scaffolds (32319 scaffolds assemble to 454Mb) RF scaffolds (21301 scaffolds assemble to 300Mb) With such a high level of faked RF mate pairs, did you filtered out these reads?
no I did not use them as they do more harm than good to my assembly. I would have filtered out bad mates if I know the truth from a reference or a close relative - but it is a luxury one does not always have for de-novo assembly. Do the two sets of scaffolds contain similar number of Ns? also if you ever keep the SOAPdenovo logs, in the
scaffold
step, what insert size does SOAPdenovo infer for the 2k lib?For scaffolds, FR-23Mb gaps,289850 gaps in total. RF-38Mb gaps,57517 gaps in total. I guess the problem was caused by the "Gapcloser" process adopted by SOAP. With FR setting, several contigs could be use to assembled in the gaps of the scaffold. I have RF run logs only, for 2k libs: 2k_libA,insert size-1975. estimated PE size 56,insert_size estimated: 1180 2k_libB,insert size-2195. estimated PE size 45,insert_size estimated: 0