Question: plant genome assembler?!
1
gravatar for Prasad
2.1 years ago by
Prasad1.5k
India
Prasad1.5k wrote:

hi all,

this might be a repeat question, couldnt find a better solution. I am working on a aromatic rice genome (~500MB genome). got the illumina hiseq data (~300 M reads of 150*2). So far i have tried Abyss, IDBA-UD, platanus, SOAP and MaSuRCA. So far IDBA-UD (1.29M scaffolds, N50- 1857, 598MB) has given better result compared to rest. MaSuRCA which performed well(paper), is not working in my case (Not necessarily has to). Here my question is are there any other tools which i could use[tried few from Assemblathon2]

Any suggestions are appreciated.

Thanks

assembly plant • 1.3k views
ADD COMMENTlink modified 2.1 years ago by colindaven840 • written 2.1 years ago by Prasad1.5k
1

Not an answer to your question, just an idea about assembling "small" genomes: why don't you get yourself a MinION (Oxford Nanopore) and get a better assembly with some nice long reads? Initial investment is quite small, one sequencing run (about 600 dollar) will give you about 15-20x coverage of this genome. Depends obviously how often you would need to do this and which quality of your assembled genome is required.

[Disclaimer: I'm a customer of Oxford Nanopore sequencing but have no other links to the company]

ADD REPLYlink written 2.1 years ago by WouterDeCoster35k

I think maybe the cost and/or error rate?!

ADD REPLYlink written 2.1 years ago by Medhat8.0k

Scaffolding genome with longer reads combined with short high quality read is not an uncommon approach. Besides, read accuracy is ~95% which is quite okay.

ADD REPLYlink written 2.1 years ago by WouterDeCoster35k

I know this specially in case of highly repetitive or complex genome combining LR (PacBio or Nanopore) with SR gives you best result , but again this all depends on the fund and the project scope (as u said if I have enough money I will get 20X coverage of any long read and every thing will be ok)

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by Medhat8.0k

@ WouterDeCoster - thanks for the suggestion. at given situation of mine no option for nanopore as of now. I have read error rate is bit high.

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by Prasad1.5k

Mira and w2rap-contigger

ADD REPLYlink written 2.1 years ago by Medhat8.0k

If you need a decent assembly, you would need other sequencing strategies as Wouter mentioned. I have had some experience with PacBio and it gave some pretty decent assemblies. The actual problem would be in cleaning and finishing the final assembly. Usually you need mate-pair sequences with different insert sizes along with paired-end data for a start.

ADD REPLYlink written 2.1 years ago by Rohit1.3k

i do have 60M matepair data (5-7Kb NextSeq data). I was hoping to get better result at contig level as quite a good coverage in terms of short reads.

ADD REPLYlink written 2.1 years ago by Prasad1.5k

The discovar assembler works well if you have overlapping PE-libraries. Give it a try. Also, some more pre-processing steps would help to acheive better N50's but not higher than 1kb from where you already are.

Did you try to error correct the sequences and try to merge the overlapping paired-end data. This gives you much more information for better contiging.

ADD REPLYlink written 2.1 years ago by Rohit1.3k

Thanks i will try discovar

ADD REPLYlink written 2.1 years ago by Prasad1.5k

If possible you should try MIRA too, it performs really well at upto 400MB genome sizes, but I have to say that the memory consumption too is high

ADD REPLYlink written 2.1 years ago by Rohit1.3k

memory was the reason i did not try. I will give it a shot. hope it works.

ADD REPLYlink written 2.1 years ago by Prasad1.5k
1
gravatar for colindaven
2.1 years ago by
colindaven840
Hannover Medical School
colindaven840 wrote:

SOAPdenovo2 should work quite well if you have long range information, for example LJD or Mate pair libraries. If not, assembly will always be a major struggle with plant genomes.

Long reads are quite challenging to use in scaffolding plant genomes (eg SSPACE long read is decent but very slow), but have a lot of potential. Hybrid assembly approaches are also challenging, one of the best in my experience being DBG2OLC + RACON.

I can't see you getting very much better assemblies with paired end data alone.

Best of luck, Colin

ADD COMMENTlink written 2.1 years ago by colindaven840

i do have 60M matepair data (5-7Kb NextSeq data). I was hoping to get better result at contig level as quite a good coverage in terms of short reads.

ADD REPLYlink written 2.1 years ago by Prasad1.5k

Ah, great. Well, make sure your insert sizes are ok on the mate-paired data and you are configuring the algorithms with the correct orientation. This gets messed up a lot with mate pairs.

You can check the mate pair insert size distribution by aligning eg with bwa or bowtie to the reference genome, then checking the resultant BAM with bamtools
bamtools stats -in x.bam -insert

It makes sense to do read trimming with your favourite tool and duplicate removal (i.e. with bbmaps dedupe.sh ) to carefully curate your reads. If the mate pair library was poor then you will have many (>80%) duplicates.

ADD REPLYlink written 2.1 years ago by colindaven840
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1469 users visited in the last hour