Question: Larger-Than-Expected Total Scaffolds Size Using Nextera Mate Pairs
3
gravatar for Rayan Chikhi
3.7 years ago by
Rayan Chikhi1.2k
France, Lille, CNRS
Rayan Chikhi1.2k wrote:

Has anyone performed assembly with Nextera mate pairs and has seen the following problem?

We're doing mammalian assemblies using Nextera 8 kbp-insert mate-pairs, and the results are abnormal. The total assembly size is way larger than expected: because it has 1 Gbp of NNN's in scaffolds. The total contigs size matches the genome size.

Some detailed information regarding the data and the assembly:

  • mammalian genome
  • Lib1: HiSeq 150 bp paired-end 500 bp insert, 30x coverage
  • Lib2: HiSeq 150 bp mate-pairs 3kbp insert (not sure about protocol), 10x coverage
  • Lib3: HiSeq 150 bp mate-pairs 8kbp insert (Nextera), 30x coverage
  • assembler: SOAPdenovo2 latest
  • Lib1 and Lib2 were adapter-trimmed using nesoni
  • Lib3 was adapter-trimmed using nextclip (which had a positive impact on scaffold N50) and to those familiar with nextclip, we kept the A-B-C categories only.

Some extra steps we tried:

  • When we assemble Lib1 and Lib2 together, the total scaffolds size is what we expect (3 Gbp, 30 Kbp scaffold N50). So all is fine here.
  • When we assemble all libs together, the total scaffolds size is too high (4 Gbp, 150 Kbp scaffold N50).
  • When Lib3 is untrimmed, the total scaffolds size is terrible (6 Gbp) and contigs size is also odd (3.5 Gbp).
  • Whether Lib3 is included in the contigs step or not (asm_flags=2 or 3) does not have a significant impact on the results.
assembly • 3.5k views
ADD COMMENTlink modified 3.5 years ago • written 3.7 years ago by Rayan Chikhi1.2k
1

Did you check for identical duplicate reads - i.e. both ends are identical? Happens a lot in mate pair libraries.

ADD REPLYlink written 3.7 years ago by ugly.betty771.0k

Yes, Nextclip filtered those. For reference this is the duplicated reads report -- 60% were unique.

  n       Count   Percent
  1       149720460       57.95
  2       57356740        23.22
  3       25391361        10.28
  4       9071636 3.67
ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by Rayan Chikhi1.2k
1

I'll add that Nextclip solved the Nextera artefact of very low insert mate-pairs.

Original library, mapped to the assembly with Lib1+Lib2, histogram created with bamstats.

After nextclip.

after nextclip

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by Rayan Chikhi1.2k

Did you map the reads on the GapClosed assembly or the scaffolded one ?

ADD REPLYlink written 15 months ago by Picasa320
1

Rayan, would you please send me your SOAPdenovo2 configuration file, and the command line to my email addr. luoruibang@genomics.org.cn, thanks Manoj Samanta for directing me here.

ADD REPLYlink written 3.7 years ago by Aqua10

I've sent it to your email, but I actually see no reason not to make it public. Here it is:

cmdline:

SOAPdenovo-63mer all -s soapdenovo2.config -K 61 -o [output prefix] -R -a 400 -p 30

config file:

max_rd_len=200
[LIB]
avg_ins=200
reverse_seq=0
asm_flags=3
rank=1
map_len=32
q1=MD_NoIndex_R1.clipped.fastq
q2=MD_NoIndex_R2.clipped.fastq

[LIB]
avg_ins=2859
reverse_seq=1
asm_flags=3
rank=2
map_len=32
q1=M273_R1.clipped.fastq
q2=M273_R2.clipped.fastq

[LIB]
avg_ins=6455
reverse_seq=1
asm_flags=3
rank=3
map_len=32
q1=8kbp/nextclip/nextclipped_ABC_R1.fastq
q2=8kbp/nextclip/nextclipped_ABC_R2.fastq
ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by Rayan Chikhi1.2k

SOAPdenovo2 authors requested it, here is the log for the map and scaff steps: http://pastebin.com/bJ2a7dUp

ADD REPLYlink written 3.7 years ago by Rayan Chikhi1.2k

A couple of people have recommended me to try another scaffolder -- that's a good idea, perhaps this is a SOAPdenovo2 specific problem, I'll explore this possibility.

ADD REPLYlink written 3.7 years ago by Rayan Chikhi1.2k

Lars Arvestad ‏requested the scaffolds distribution length (Lib1+Lib2, prior to scaffolding with Lib3), here it is:

http://s30.postimg.org/htqcz7581/output.png

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by Rayan Chikhi1.2k
0
gravatar for Charles Warden
3.7 years ago by
Charles Warden4.9k
Duarte, CA
Charles Warden4.9k wrote:

Perhaps you can try out these pre-processing steps and see if that helps?

http://genomics-pubs.princeton.edu/prv/scripts.shtml#Part_I

You would skip step #3 unless you were worried about contamination (and that would be specific to your own experiment anyways)

ADD COMMENTlink written 3.7 years ago by Charles Warden4.9k

Thanks for answering. My understanding is that step #1 should not have any influence on the assembler (SOAPdenovo2 doesn't take into account qualities as far as I'm aware), and step #2 (adaptor trimming) is already taken care of by Nextclip. Host contaminant (step #3) is difficult to believe; if the extra 400 Mbp - 1 Gbp in scaffolds was contamination, it'd also be present in the contigs.

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by Rayan Chikhi1.2k
1

True - but the mononucleotide and quality trimming might be useful

ADD REPLYlink written 3.7 years ago by Charles Warden4.9k

Good point -- I realize that Nextclip didn't do quality trimming. I'll run nesoni on nextclip output and report the assembly results.

ADD REPLYlink written 3.7 years ago by Rayan Chikhi1.2k

That was a good idea apparently. Quality-trimming and adapter-trimming the reads using Nesoni, _after_ adapter-trimming them with Nextclip, further removed 20% of bases of low quality, or where the Nextera adapter was still present. I ran nextclip with recommended parameters from the manual (-min_length 25 --trim_ends 0 --remove_duplicates) and nesoni with default parameters.

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by Rayan Chikhi1.2k

To follow up: turns out that Nextclip-trimmed reads produced overall sightly more contiguous scaffolds (with SSPACE) than the Nextclip+Nesoni-trimmed reads (96 Kb vs 92 Kb, same assembly size). I suspect that further quality-trimming the reads reduces coverage.

ADD REPLYlink written 3.7 years ago by Rayan Chikhi1.2k
0
gravatar for Rayan Chikhi
3.7 years ago by
Rayan Chikhi1.2k
France, Lille, CNRS
Rayan Chikhi1.2k wrote:

Answering my own question because.. problem was solved by using a stand-alone scaffolder! instead of SOAPdenovo2's. Thanks to those who suggested it on Twitter.

I ran SSPACE, using the Nextera library (Lib3) trimmed by Nextclip, to scaffold the Lib1+Lib2 assembly. The results are now reasonable: 3.2 Gbp assembly, only 150 Mbp were added as NN's. The scaffold N50 is 95 Kb, which isn't as good as the 150 Kb produced by SOAPdenovo2, but at least the assembly doesn't have 1 Gbp of N's..

Other runs of SSPACE using different trimming parameters (nesoni only, nextclip + nesoni) did not produce significantly different results.

I'll report results with the BESST scaffolder once I have them -- for some reason Bowtie of SSPACE ran orders of magnitude faster than BWA-MEM of BESST on my node.

ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by Rayan Chikhi1.2k
1

What do you mean by 'for some reason'? Bowtie is for exact or near exact search only, whereas BWA-MEM is more versatile (and hence the cost).

ADD REPLYlink written 3.7 years ago by ugly.betty771.0k

We still need to know why SOAPdenovo2 did not work. Maybe Ruibang will help.

ADD REPLYlink written 3.7 years ago by ugly.betty771.0k

Yes, I agree...

ADD REPLYlink written 3.7 years ago by Rayan Chikhi1.2k

I realize that Bowtie has less to do (it's in 0-errors mode in SSPACE), but I said "for some reason" because I suspect that my cluster node has other OS or hardware issues. Other benchmarks led me to believe so. BWA-MEM shouldn't take 24 hours to align 100M pairs using 15 threads, yet it did.

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by Rayan Chikhi1.2k
1

I suspect it is OS-related issues, but BWA-MEM can get slow in repetitive regions. Is that why SOAPdenovo failed?

http://www.homolog.us/blogs/blog/2014/03/31/bwa-mem-gets-very-slow-in-highly-repetitive-regions/

ADD REPLYlink written 3.7 years ago by ugly.betty771.0k

Those mate-pairs, despite being quality-filtered by Nextclip, indeed had significantly lower mapping-quality than other data that I typically map.

mapq mean       44.8779
mapq stdev      23.5473

Anecdotally, human tumor paired-end reads mapped to hg19 have the following stats:

mapq mean       52.3832
mapq stdev      17.3866

Stats computed using sam-stats from ea-tools.

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by Rayan Chikhi1.2k

"BESST scaffolder" ? Could you enlighten me? Where could I find any details on this (new?) scaffolder?

ADD REPLYlink written 3.7 years ago by lutz.fr10

https://github.com/ksahlin/BESST

ADD REPLYlink written 3.7 years ago by Rayan Chikhi1.2k

Thanks a lot for the link !!

ADD REPLYlink written 3.7 years ago by lutz.fr10
0
gravatar for Rayan Chikhi
3.6 years ago by
Rayan Chikhi1.2k
France, Lille, CNRS
Rayan Chikhi1.2k wrote:

Also following up on a SOAPdenovo2-only assembly, I was able to get the assembly size down to 3.4 Gbp following advice from the SOAPdenovo team.

The stats are now: 600 Mbp of N's (instead of 1 Gbp before), 168 Kb scaf N50.

The changes were (in suspected order of importance, only a guess):

  1. Performed Nextclip on the 3kbp library as well, as it was also a Nextera lib
  2. Better insert sizes estimations in soapdenovo config: 250 for PE, 2500 for Lib2, 7500 for Lib3

Also noted that Lib2 had unusually large standard deviation (~1500), which may explain the bad scaffolding.

ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by Rayan Chikhi1.2k

By running SOAPdenovo2's GapCloser, the number of N's dropped from 600 Mbp to 160 Mbp and the assembly size dropped from 3.4 Gbp to 3.3 Gbp. While I cannot verify that the filled gaps are correct, I can only assume that SOAPdenovo2 over-estimated gap sizes during scaffolding, and gap-filling fixed that.

ADD REPLYlink written 3.5 years ago by Rayan Chikhi1.2k
0
gravatar for Rayan Chikhi
3.5 years ago by
Rayan Chikhi1.2k
France, Lille, CNRS
Rayan Chikhi1.2k wrote:

My final conclusion, whenever an assembly is larger than expected:

1) Use a different scaffolder

or

2) SOAPdenovo2 should produce an assembly of correct size, when parameters are tweaked in the following way:

  • Run Nextclip on all Nextera libs (if applicable)
  • Run SOAPdenovo2's GapCloser: it will fill gaps, which as a byproduct will correct bad gap size estimation for the filled gaps
  • Provide accurate insert size estimations to SOAPdenovo2. (What I typically do is a quick assembly using SOAPdenovo2 or Minia with inaccurate parameters, then extract contigs above 10 kbp using seqtk, then use this script https://gist.github.com/rchikhi/7281991.)

Thanks to everyone who responded to this thread, and to SOAPdenovo's team!

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by Rayan Chikhi1.2k

For me Nextclip works well with the scaffolded output of SOAPdenovo, I mean my mate pairs are well oriented in RF sens. But once GapCloser done, I come up with a lot of FR or tandem mate pairs.

Did you have the same problem ?

I map my reads with bowtie2 and compute my insert size with Picard.

ADD REPLYlink written 15 months ago by Picasa320
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 540 users visited in the last hour