4.2 years ago
mks002 ▴ 190

I have 160 Gb R1 reads and 160 Gb R2 reads from Small Insert and Long insert Illumina sequencing 151*2. I tried different assembler for assembly of the medicinal plant genome.

First i Tired Abyss , second Platanus and last Soap. but non of them are able to give the desired result.

Anyone has some idea how to go with this data. IDBA i have not tried since it takes lot of time and good for 500 mb genome.

I have around 1.2 TB space and 750 Gb RAM size.

Please help me in this regard.

what is your desired result? did you tried any gab closing software after assembly? what is the average coverage?

1.5 Gb is what my genome assembly result should be after finshing.

Assembly is not satisfactory so no point of going further ahead and doing gap closing.

With SOAP I am getting 650 mb scaffold sequence (656,762,102 bp) and with a N50 value of 191 and Total Number of Non-ATGC Characters : 19,888,112 The Average Coverage is 100X

coverage? long insert you mean jumping libraries mate pairs?

coverage of ~100X . No it is not matepair.

you have a very good coverage, But also as you can see from other answers existence of mate pairs play crucial rule in detection of structure variations, rearrangement and also form contigs

Did you tried any other tools from the suggested in answers?

I am assuming (for obvious reasons) that by "desired results", you mean the no. of scaffolds and the average scaffold size, gaps (may be!). Do you have any idea about the genome of "the" medicinal plant? Else, how are you comparing/judging your assembly?

You may try using Minia or All-paths-LG.

Yes correct the number of scaffolds is too many and the N50 value is very less across different assembly tool i used. The genome we are expecting is 1.5 Gb.

You realize that having plenty of sequence data (if that is what you are basing your question on) is not a guarantee that you will get a successful/useful assembly. Plant genomes are notorious to work with due to ploidy issues etc (do you expect a simple diploid genome?) so this result may not be unexpected.

Trying the All-paths-LG but its giving error.

I have Given ploidy as 1.

Is it necessary to have a mate pair library.

1. How do you know your genome size?
2. Have you tried to estimate a genome size from reads? (for example using https://github.com/gmarcais/Jellyfish).
Genome size determination was done using flow cytometry.

From Read level, KmerGenie was ran and below is the results

Predicted best k: 101

Predicted assembly size: 1625708994 bp

Hi, First you should try to know how much percent of your genome is repeat and whether your sample is inbread line. And there are some parameters will affect the result of assembly, you should try different parameters. Also the pre-process,like trimed reads, remove contamination and error correct will also affect the result. You can try Allpaths-LG or Masurca,they usually give good result, but they probably need more space than 1.2TB. You can also try Platanus with parameter u 1.0 and scaffolding with SSPACE, and then gap-closer. The scaffold and gap-closer you can run multiple literates.

1
4.2 years ago
arnstrm ★ 1.8k

Well, to begin with, it looks like you don't have a mate-pair library. Without this, don't expect anything good from your assembly stats, especially for a plant genome. This is the bare minimum for a plant genome assembly. For a decent publishable genome (in a good journal), you may even need additional data such as optical map/Hi-C and/or long reads (pacbio) to get a decent assembly. That said, I recommend first running GenomeScope on your data. This will tell you right away what you're dealing with: heterozygosity, error rate in your data, repeat content, ploidy-ness and the estimated coverage. From this you can decide if you really want to proceed with the assembly or wait for more data. In general, for a highly heterozygous genome, I recommend using Platanus assembler, followed by scaffolding (mp) and running redundans to gap-fill. This gave us a great assembly for a 2.5Gb highly heterozygous plant genome (with ~80% repeat contnet). MaSuRCA does really well for inbred lines (but you need to have mate-pair). I recently came across w2rap-contigger but I don't know how it compares. I hope this helps!

put in your consideration that

the optimal coverage depth for Platanus is approximately >80

and here it was not clear the coverage

a side question what was the coverage in your case?

Coverage is ~100X. Now i have the mate-pair data too (25 million reads).

I have tried MaSuRCA its giving an error.

Error > ./assemble.sh: line 94: 22731 Aborted (core dumped) quorum_create_database -t 46 -s $JF_SIZE -b 7 -m 24 -q$((MIN_Q_CHAR + 5)) -o quorum_mer_db.jf p1.renamed.fastq p2.renamed.fastq p3.renamed.fastq p4.renamed.fastq

I had 1.2 Tb space during the run and 774014 MB Ram Memory. In config file option i had given as "GRAPH_KMER_SIZE = auto" and other default option.

can u suggest why the error is and what other options should be looked into?

You can edit the assemble.sh script and decrease \$JF_SIZE, in your case it will take much more than what you have. But to be honest, MaSuRCA will take more than 1.2TB.

You need much larger space than 1.2Tb. If I remember correctly by the time it creates a work1 directory, it will already be touching 1Tb in space. Run du -sh masurca_run_dir command to see how much space it is using. Also do df to see how much available space your machine has. I doubt MaSuRCA does any better without mate pair libraries.

