Has anyone performed assembly with Nextera mate pairs and has seen the following problem?
We're doing mammalian assemblies using Nextera 8 kbp-insert mate-pairs, and the results are abnormal. The total assembly size is way larger than expected: because it has 1 Gbp of NNN's in scaffolds. The total contigs size matches the genome size.
Some detailed information regarding the data and the assembly:
- mammalian genome
- Lib1: HiSeq 150 bp paired-end 500 bp insert, 30x coverage
- Lib2: HiSeq 150 bp mate-pairs 3kbp insert (not sure about protocol), 10x coverage
- Lib3: HiSeq 150 bp mate-pairs 8kbp insert (Nextera), 30x coverage
- assembler: SOAPdenovo2 latest
- Lib1 and Lib2 were adapter-trimmed using nesoni
- Lib3 was adapter-trimmed using nextclip (which had a positive impact on scaffold N50) and to those familiar with nextclip, we kept the A-B-C categories only.
Some extra steps we tried:
- When we assemble Lib1 and Lib2 together, the total scaffolds size is what we expect (3 Gbp, 30 Kbp scaffold N50). So all is fine here.
- When we assemble all libs together, the total scaffolds size is too high (4 Gbp, 150 Kbp scaffold N50).
- When Lib3 is untrimmed, the total scaffolds size is terrible (6 Gbp) and contigs size is also odd (3.5 Gbp).
- Whether Lib3 is included in the contigs step or not (asm_flags=2 or 3) does not have a significant impact on the results.