Recommendations For Heterozygous Genome Assembly Software
10
7
Entering edit mode
10.1 years ago
Leszek 4.2k

I'm trying to assemble small (20Mb), diploid fungal genome from MiSeq reads (~400bp after merging, 100x coverage).

The tricky thing: it's heterozygous. The divergence is ~4%, but there are hundreds (thousands?) loss of heterozygosity (LOH) regions, accounting in total to almost half of the genome...

Do you know of any assembler (methodology) capable of handling with such data?

I have tried many assemblers/tricks:

• de Bruijn graph: Velvet, ABySS, SOAPdenovo
• overlap-based: MIRA, Newbler
• clustering reads and then assembling with CAP3

but with rather bad effects. Every time, the assembly was very fragmented (N50 ~10kb), homozygous regions (LOH) were collapsed with ~200x, while heterozygous regions were separated with 100x.

I will be happy to hear about any ideas:)

EDIT

As no satisfactory solution exists, I'm developing Redundans (check below for more info).

assembly miseq illumina genome • 17k views
0
Entering edit mode

Hello Leszek,

I am attempting to run your program but am having issues. I first installed using the pre-compiled binaries via your link and then tried to manually install all of the programs. After confirming all the dependencies have been acquired I am still getting some errors in the redudans.py script. Any suggestions are greatly appreciated! I am running ubuntu 16.04 on an Intel Core i7.

>~/src/redundans\$ ./redundans.py -v -i test/*.fq.gz -f test/contigs.fa -o test/run1
Options: Namespace(fasta='test/contigs.fa', fastq=['test/5000_1.fq.gz', 'test/5000_2.fq.gz', 'test/600_1.fq.gz', 'test/600_2.fq.gz'], identity=0.51, iters=2, joins=5, limit=0.2, linkratio=0.7, log=<open file '<stderr>', mode 'w' at 0x7f003a9bc1e0>, mapq=10, minLength=200, nocleaning=True, nogapclosing=True, noreduction=True, noscaffolding=True, outdir='test/run1', overlap=0.66, resume=False, sspacebin='~/src/SSPACE/SSPACE_Standard_v3.0.pl', threads=4, verbose=True)

>Traceback (most recent call last):
File "./redundans.py", line 450, in <module>
main()

>  File "./redundans.py", line 445, in main
o.verbose, o.log)

>File "./redundans.py", line 271, in redundans
os.makedirs(outdir)

>File "/usr/lib/python2.7/os.py", line 157, in makedirs
mkdir(name, mode)

>OSError: [Errno 13] Permission denied: 'test/run1'

0
Entering edit mode

This is not an answer but rather another question. It would be better to open a new thread for your question if you need people to respond quickly.

Also, the error seems to be with test/run1, do you have a sub-directory named test in your directory already?

0
Entering edit mode

thanks, I'll ask a new question. But the directory does not. Using my own data set, the error is now failure to recognize GapCloser

> spades/CPBscaffolds.fasta -o redundansCPB


despite GapCloser in directory:

src/redundans
>CHANGELOG.md          fastq2insert_size.py   GapFiller_v1-10_linux-x86_64
docs                  fastq2insert_size.pyc  INSTALL.sh
fasta2homozygous.pyc  fastq2sspace.py        pyScaf.py
fasta_stats.pyc       GapCloser              test

0
Entering edit mode

you need to add the path to GapCloser to your system PATH. Have a look how to do it: https://github.com/lpryszcz/redundans/issues/23

5
Entering edit mode
7.9 years ago
Leszek 4.2k

As there were no good solutions, I've developed a tool that assists assembly of heterozygous/hybrid genomes (Redundans).

The program takes as input assembled contigs, paired-end and/or mate pairs sequencing libraries and returns scaffolded homozygous genome assembly. Resulting assembly should be less fragmented and with total size smaller than the input contigs. In addition, Redundans will automatically close the gaps resulting from genome assembly or scaffolding (more details).

Hopefully, someone else will benefit from that! I'll be really grateful for your feedback.

BTW: sorry for being necromancer;)

1
Entering edit mode

Sounds interesting. Two quick questions:

1. What is the practical upper limit for genome size to be used with your pipeline
2. Since your manual already mentions SPAdes, have you looked at how your tool performs in comparison to dipSPAdes?
1
Entering edit mode
1. To be honest, I don't know and I would love to check that. What is your genome of interest? I've been playing mostly with fungal genomes (10-50Mb) and recently with Arabidopsis (119Mb). I would like to try it on larger genomes, but don't have the right data.
2. Redundans works better than dipSPAdes especially if:
• you have some recombinations (loss of heterozygosity) in your genome
• tiny (1-3%) or large (>10%) divergence between heterozygous copies
• your genomic libraries are with different read length (then all de novo assemblers will have problems).

For more details, have a look in the manuscript.

0
Entering edit mode

Thanks a lot for the manuscript, it was quite informative. I have two quite different sets: a couple of 30-50 Mb diploid protists with asexual reproduction - quite heterozygous but w/o recombination, and a ~3Gb heterozygous und quite repetitive plant genome with a very fragmented draft assembly...

For the protists I use overlapping MiSeq 300 bp libraries and (corrected) PacBio reads. I'm pretty happy with the dipSPAdes hybrid assemblies, but I'd like to see what Redundans can do. Although, given my data I'll probably only be able to run reduction and than use sspace-longread/pbJelly instead the sspace/gapcloser combi for scaffolding.

0
Entering edit mode

Try running full Redundans pipeline for protists genomes - sometimes I got improvement with just one paired-end library. But of course the best would be to have some mate-pairs. If you feel like, you can split PacBio reads and create artificial mate-pair reads. Anyway, sspace-longread should work good.

What kind of libraries do you have for your plant genome?

Let me know if you have some problems, I'll be happy to help!

1
Entering edit mode

Yeah, I was thinking about using artificial MP libs from PacBio reads as well. I let you know, how it works.

For the plant genome I have 100bp overlapping HiSeq reads plus 500 and 800 bp insert size PEs, and also MP libs from 2kbp to 20kbp, however only with 49bp read length, which makes them hard to map uniquely. And there are a lot of repeats, which no doubt will interfere with redundancy reduction. I'll give it a try as well after it worked for the small genomes.

0
Entering edit mode

I think your plant dataset (3xPE + 2xMP) is perfect to try Redundans. I've been playing with 100bp PE (assembly+scaffolding) and 50bp MP (scaffolding) libraries with good results. You are right, many reads will fail to align uniquely, but those that will should improve assembly considerably!

Since you have overlapping PE, you may also try ALLPATH-LG in haploidify mode. But you will need some powerful machine to launch it.

1
Entering edit mode

I used ALLPATHS-LG with haploidify for the draft assembly - a 2TB RAM machine was necessary to run it without digital normalization :)

0
Entering edit mode

I know, it's crazy. I needed 128GB RAM for 20Mb genome! How it performed? In my case, so so, especially taking into account the time and resources needed for the run...

0
Entering edit mode

Well, it handled the data set (which not every assembler managed to do in the first place) and it produced the "best" assembly of all the things I tried (including SOAP-denovo, ABySS, Masurca, Platanus ...). But we are talking N50 < 20kb for scaffolds and given the computational requirements - it's not like I can just rerun it a few times to optimize configuration...

0
Entering edit mode

Hello thackl. I was interested about your experience with Platanus software, and I am starting to use Platanus and it gives me this error: K = 31, saving kmers from reads... Error(5): Kmer distribution exception!! Kmer distribution cannot be caluculated proper. Do you have any good suggestions for this problem? Thanks!

0
Entering edit mode

Hello thackl. I was wondering about your experience with dipSPAdes, did you try to estimate unique gene content prior and after using dipSPAdes? I found the algorithm to reduce the assembly size considerably and even found cases where tRNA synthatases were missing from the dipSPAdes assembly but not input.

1
Entering edit mode

Hello Leszek, Thanks for mentioning Redundans, I have tested it on a few recent assemblies and I am quite pleased. I have a question however, your default parameters are a minimum identity of 51% and minimum overlap of 66%, do you not find those criteria a little bit too low (i.e. that you will remove contigs that should not be removed because they contain unique information)? Typically assemblers remove bubbles ate 80-90% identity, no?

1
Entering edit mode

Good point, Adrian, the identity/overlap cut-off are rather relaxed because usually you don't know what is the level of divergence between your heterozygous regions up front. Users are free to use more restrictive thresholds (--identity/--overlap parameters).

What I've noticed in simulations is that using relaxed identity/overlap cut-off works fine, unless you expect some segmental duplications that may be removed by such relaxed cut-off. These will appear most likely as gaps in your assembly, anyway.

1
Entering edit mode

hi, as you said, there is no good solution for heterozyogous genomes. my question is : there are lots of repeat regions in heterozygous genomes which are included in redundancy. if you remove them, that means you will lose lots of genetic information, doesn't that?

0
Entering edit mode

Yes, reduction step will remove contigs representing highly similar repeats. but if you have mate pair libraries you will recover these missing regions during scaffolding step. From simulations with A. thaliana genome (~25% of which is repeats) I've observed reduction of the assembly to ~80-90% of expected size, but after scaffolding with mate pairs the assembly was completely fine (100% of expected size). The manuscript is in revision, so stay tuned. btw: post to Redundans mailing list for more details.

1
Entering edit mode

These are interesting results but A. thaliana is highly selfing and has a tiny genome. I'm not sure how/if these findings can be directly related to the question about repetitive DNA and heterozygosity issues, or more specifically, to typical plant genomes. For a genome like A. thaliana where you can assemble 90% of the expected size, I would say there is little to be gained by any k-mer reduction/normalization or further scaffolding. I know it is a demonstration of the method and there is the practical consideration that you can't assemble large genomes easily by any method. Though, it would be nice to see some kind of broader application of these methods for comparison.

0
Entering edit mode

good point. we are involved in very challenging genomes (2-3Gb plants), but unfortunately for these challenging genomes there is no good reference to compare with...

0
Entering edit mode

You are right of course, the current lack of genomes in that size range makes it very difficult to compare methods. Though, I think this will change very soon due to PacBio and a number of new hybrid sequencing and optical mapping approaches.

0
Entering edit mode

yes, finally, you might get expected size through scaffolding with mate pairs, but i suppose we would get more N's filling these repeats, isn't it?

best, Elva

0
Entering edit mode

the gaps can be filled ie using GapCloser, removing most of Ns from the assembly

0
Entering edit mode

I am trying to assemble an invertebrate genome and it is known to be highly heterozygous. From what I understand, assembly of a highly heterozygous genome should result in a larger than estimated genome size (whereby Redundans can work on reducing the redundancy). However, in my case, my resulting assembly is a lot smaller than the expected size with only 40% of total reads mapping back to the assembly.

From my total Illumina reads, I expect to have 50x coverage of the genome - and this coverage is further lowered according to homozygous (1C peak ~13x) and heterozygous (2C peak ~28x) regions. Based on your experience with polymorphic genomes, do you think lack of coverage is the reason for the incomplete smaller assembly size?

0
Entering edit mode

you can post your questions regarding redundans / heterozygous genome assembly in https://groups.google.com/forum/#!forum/redundans

give more details, what is the species, expected genome size, what kind of libraries / assembler you are usings etc

5
Entering edit mode
10.1 years ago

Erich Schwarz and some other nematode genomicists have had good luck applying digital normalization to the reads first; see the khmer handbook here: http://ivory.idyll.org/blog/an-assembly-handbook-for-khmer.html. You might try that. I'd be happy to correspond with you if you have any trouble. --titus

0
Entering edit mode

This approach looks very interesting. I would like to see how this and Hapsembler, or other approaches, compare.

0
Entering edit mode

I will definitely give it a try:) Thanks, will reply shortly.

3
Entering edit mode
10.1 years ago

Vinson et al. adapted ARACHNE to assemble the polymorphic Ciona savignyii genome a few years back, and their recipe is detailed in this paper: http://www.ncbi.nlm.nih.gov/pubmed/16077012

It looks like BROAD provide documentation for how to run ARACHNE on polymorphic genomes (and specifically mention fungal genomes) here: http://www.broadinstitute.org/crd/wiki/index.php/Polymorphism

0
Entering edit mode

the wiki is useful. but I was unable to run ARACHNE. is it possible to run it outside BROAD?

2
Entering edit mode
7.8 years ago
amvarani ▴ 20

I have been using a combination of platanus and redundans with notable results.

My genome is auto or allo tetraploid, heterozygous (~3%) and has 320-330Mb (160Mb 1C).

I have made 180M of reads 2x 100bp PE (350pb insert size) and 120M 2x100 Mate pair (3kb insert size). I have used platanus in a first stage (assemble, scaffold and gap_close) with the -u parameter set to 1.0, and starting the k with 21. (I have tested all the assemble combinations, and that ones gave to me the best results, considering largest scaffold, N50 and number of contigs).

My result in this first stage is a high fragmented assembly, with more than 1M contigs, with N50 less than 1kb.

However, 160Mb of the assembly is contained in contigs larger than 500p (largest scaffold with 950kb). If these contigs with less than 500bp are excluded, my N50 increase to 120kb. This means that platanus did a very good job merging the haplotypes (certainlly due the -u 1.0 parameter).

In the second step, I have used redundans with my fragmented assembly (all contigs) and the results are very interesting! My final assembly size is 150Mb with N50 of 380kb and largest scaffold of 3,4M !! This using default redundans parameters. Now I'm going to play with --identity, --overlap, --minLength, --limit and --iters parameters.

I'm really pleased with that!

0
Entering edit mode

Great to hear that! Thanks for testing & sharing. Let me know if you experience any difficulties.

0
Entering edit mode

I recommend looking at the stats (largest contig/n50) after redundans gets rid of contigs that are redundant, just to have an idea what percent was extra. Then the rest of the job is done with SSPACE and GapCloser.

0
Entering edit mode

That's right, Adrian, but you may also see improvement in Redundans from using BWA MEM instead of bowtie for scaffolding and of course multiple iterations of scaffolding/gap closing.

0
Entering edit mode

Agreed for sure!

0
Entering edit mode

Hi amvarani,

I am using platanus to assemble two pair end libraries and one mate pair library for scaffolding. Do I need to reverse complement the mate pair library before assemble. Thank you!

Hui

0
Entering edit mode

Hi Hui,

In general the mate-pair libraries made with illumina nextera are in outie orientations (i.e <------ ------>). If you have that kind of libraries you don't need to do anything.

If you have a combined fastq (interleaved) file you must use the -op , -ip option (lowercase), if you have both paired files you should use -OP -IP (uppercase) . I have noted that depending of file name platanus has a strange bug which results in a wrong recognition of the files.

For instance, your both fastq (for -OP option mate-pairs) files must have exactly the same name AND containing letters and numbers (except to distinguish the pairs). I would suggest something like Library003_MP1.fastq and Library003_MP2.fastq.

0
Entering edit mode

Hi amvarani,

Thank you so much for helping! I have two pair end libraries and one mate pair library.

I used my normalized fastq files to assemble, but it alway said 'Read file exception!!'

Below is my script and the error. How to fix it? Thanks a lot!

/diag/home/hzz0036/software/platanus assemble -k 21 -u 1.0 -f /diag/home/hzz0036/goosegrass_genome/clc_trimmed_genome/AU_normalized/DNA-1_CGATGT_L002_R1_001_paired_trimmed_paired_1.fastq.normalized_K25_C30_pctSD200.fq /diag/home/hzz0036/goosegrass_genome/clc_trimmed_genome/AU_normalized/DNA-1_CGATGT_L002_R1_001_paired_trimmed_paired_2.fastq.normalized_K25_C30_pctSD200.fq /diag/home/hzz0036/goosegrass_genome/clc_trimmed_genome/PBU_normalized/SM01-PBU1_GTCCGC_L005_R1_001_paired_trimmed_paired_1.fastq.normalized_K25_C30_pctSD200.fq /diag/home/hzz0036/goosegrass_genome/clc_trimmed_genome/PBU_normalized/SM01-PBU1_GTCCGC_L005_R1_001_paired_trimmed_paired_2.fastq.normalized_K25_C30_pctSD200.fq /diag/home/hzz0036/goosegrass_genome/PBU_7k_normalized/SM01-PBU1-7k_CCGTCC_L005_R1_001_paired_trimmed_paired_1.fastq.normalized_K25_C30_pctSD200.fq /diag/home/hzz0036/goosegrass_genome/PBU_7k_normalized/SM01-PBU1-7k_CCGTCC_L005_R1_001_paired_trimmed_paired_2.fastq.normalized_K25_C30_pctSD200.fq -o /diag/home/hzz0036/goosegrass_genome/platanus

Error(2): Read file exception!!
Read file is not FASTA/FASTQ format.


Best regards,
Hui

0
Entering edit mode

Hi Hui,

Try a shorter name for your files, like as K25_C30_pctSD200.fastq

0
Entering edit mode

Hi amvarani,

I was interested about your experience with Platanus software, and I am starting to use Platanus and it gives me this error:

platanus assemble -f AS_R1.clean.cor.fastq AS_R2.clean.cor.fastq \
AS285_S40_L007_R1.clean.cor.fastq \
AS285_S40_L007_R2.clean.cor.fastq \
AS485_R1.clean.cor.fastq AS485_R2.clean.cor.fastq \
AS675_R1.clean.cor.fastq AS675_R2.clean.cor.fastq \
-t 20 -o /public/home/zpxu/AS/results/platanus_quake/AS \
-k 31 -m 1128 -u 0.3 -d 0.3 -a 5

K = 31, saving kmers from reads...
Error(5): Kmer distribution exception!!
Kmer distribution cannot be caluculated proper.


I have tried many times for different k and more big m value, but it also gives me the same error. So, I really don't know why and how to solve this problem. Could you help me ? Thanks!

Best regards,
Zhongping Xu

0
Entering edit mode

With Platanus, did you also set these options for scaffold and gapclose ?

Platanus scaffold -s 21 -v 21 -u 1
Platanus gap_close -s 21 -k 21 -vo 21 -vd 21

1
Entering edit mode
10.1 years ago
Rayan Chikhi ★ 1.5k

You might want to take a look at Hapsembler (From the web page: Hapsembler is a haplotype-specific genome assembly toolkit that is designed for genomes that are rich in SNPs and other types of polymorphism.).

1
Entering edit mode

tried it, played with parameters, corrected reads, but with no success... the assembly is quite similar to Newbler one

1
Entering edit mode
10.1 years ago
Rayan Chikhi ★ 1.5k

BGI reported that the scaffolding module of SOAPdenovo2 has been improved to deal with heterozygosity. Did you try to run the full Soapdenovo2 pipeline?

1
Entering edit mode

SOAPdenovo2 seem to work really good! It uses multiple k in one run and is very fast. It return separate contig for each haplotype. I have to investigate the quality further.

1
Entering edit mode

the assembly is still quite fragmented. but got some ideas to improve it further :)

1
Entering edit mode
1. Have you tried Planatus? I used this software for my 6% heterozygous genome and it improve scaffold continuity against SOAPdenovo2

2. Have you published your manuscript yet? If you did, will you put a link here? I think people will be interested in playing with the data.

0
Entering edit mode

I did use the platanus assemble,which help us a lot for the assemble. But could you publish the manuscript? cannot wait!

0
Entering edit mode

@kevinchjp

1. yes, platanus works much better with my data. although I still see room for improvement;)
2. the manuscript is in the production... hopefully early next year. I'll keep you posted!
1
Entering edit mode
7.9 years ago
rtliu ★ 2.2k

Platanus Genome Assembler was designed for genomes of higher heterozygosity.

The paper was published on Genome Biology in 2014.

0
Entering edit mode

In fact I've tried Platanus. It works well, especially for divergence between heterozygous regions <10%. But I found Redundans returning less fragmented assemblies and it recognise heterozygous regions with divergence up to 45%.

1
Entering edit mode
7.8 years ago
mikael.dahl ▴ 20

Hi, I do also struggle with a heterozygous genome. I found your post and your recommendation to use the -u 1.0 flag in platanus. Did you look into how it affects your assembly?

In the readme file for platanus is says:

-u FLOAT: Maximum difference for bubble crush (identity, default 0.1) Larger FLOAT increases the number of bubbles merged. If heterozygosity of the sample is high, large FLOAT may be suitable (Ex. -u 0.2).

So I guess that using 1.0 means that all bubbles in the graph will be collapsed, but how is the choise between the two haplotypes made? quality on reads? Since the genome is heterozygous in those positions both paths of the bubble is true and if we randomly choose to keep only one there will be tons of reads (if we do an RNA seq experiment and use the genome as a reference) that not will match because we collapsed the allele being expressed.

I mean if the authors of an assembler that is specialized on heterozygous genomes recommend to increase the -u value to 0.2 in case of high heterozygosity it feel kind of extreme to use 1.0.

0
Entering edit mode

This should be added as a comment to the answer it relates to. You raise good points for discussion but it is confusing if out of place.

0
Entering edit mode

Good point Mikael, I agree with you, however I have tested -u from 0.1 to 2.0. For each assembly, were evaluated: Assembly size, scaffold N50, Contig N50, largest Contig and number of contigs. I have found better results using the -u 1.0.

For instance using -u 0.2, also give good results, but with -u 1.0 retrieved the largest scaffolds and also N50. I also have RNAseq (single end 200bp) data and about of 94% of all RNAseq reads map in this assembly, which is pretty high rate!

Regarding the bubbles, Platanus generate a bubble fasta file. Looking this file (BLASting, in fact) I have noticed that most of the bubbles are mobile genetic elements, like as LTR retrotranspons and also microsatelaties or telomer and centromer derived . Then I decided to collapse LTR retros all and try to link scaffolds with several platanus scaffold iterations (using the same -u 1.0 parameter and also -s 16 and -v 16), and then redundans. After redundans step, I did again several scaffolds iterations with platanus. I have also disabled the GapCloser and SSPACE step on redundans (In fact I have used only the reductions step)

After these steps, I have mapped all my reads back to the assembly (about of 90% mapped). I have now assembled the rest of the reads (10%) without coverage cufoff disabled with SPAdes 3.5.0, which gave to me some contigs/scaffolds which I have added to the main assembly, and then some more scaffold iterations with platanus, and then another redundads step (only reduction).

In my final file I have used GapFiller with 10 iterations.

Doing that, at the moment I got a N50 of 3,5Mb and Contig N50 of about 1Mb.

This receipt worked for my genome, maybe be useful for the colleagues here.

Some features which can be implemented in redundans:

1. For the output a fasta file with the redundans contigs/scaffold (very important!!! with this file we can figure out the repeats type found in your genome);

2. Option to choose SSPACE or Platanus Scaffold;

3. Option to choose GapCloser or GapFiller (or maybe both!). I would also recommend the platanus gap_closer as a option;

0
Entering edit mode
7.8 years ago
amvarani ▴ 20

Hi there

Other good scaffolding option is the scafmatch software.

0
Entering edit mode

ScaffMatch requires SAM-formatted input independently from F and R reads, which is a bit problematic, but I'll try it. Thanks for hint!

0
Entering edit mode

Hi, ScaffMatch supports sam and fastq