Question: Tool for PacBio Denovo assembly of diploid plant genome (size 500mb) having around 52x coverage
0
gravatar for Sudhir Jadhao
3.8 years ago by
India
Sudhir Jadhao60 wrote:

Hello Everyone

I have query regarding tool  for de novo assembly of PacBio data. I have plant genome data at 52X coverage. The genome size is around  500 mb 

I used HGAP (RS_HGAP_assembly2, RS_HGAP_assembly3 , RS_preassembly 2) tool through SMRT portal it giving me error of "ERROR! Reading fasta files greater than 4Gbytes is not supported" . It is not supporting large gnome size

Then I used falcon it run successfully  but in assembly folder the all files are empty except preads.ovl, 2-asm-falcon/run_falcon_asm.sh.log 

Can you please suggest me tool for assembly for data having 52x coverage and predicted genome size is 500mb .

Thank you

myposts • 2.3k views
ADD COMMENTlink modified 3.8 years ago by lexnederbragt1.2k • written 3.8 years ago by Sudhir Jadhao60
1
gravatar for rhall
3.8 years ago by
rhall160
United States
rhall160 wrote:

Falcon is probably the best option, rather then try other assemblers I would try to diagnose what went wrong with the Falcon run, it's likely just a parameter problem.

What is the output of DBstats ./1-preads_ovl/preads.db

 

With 52x of PacBio data, hybrid assembly with something like DBG2OLC would not add anything.

ADD COMMENTlink written 3.8 years ago by rhall160

Thank you for kind reply,

I rerun falcon  on whole data . it run successfully . but the output is in kb

preads.db log

files =       106
       3749 out.00001 prolog
       7455 out.00002 prolog
      12044 out.00003 prolog
      18311 out.00004 prolog
      24536 out.00005 prolog
      30657 out.00006 prolog
      36772 out.00007 prolog
      42550 out.00008 prolog
      46543 out.00009 prolog
      50505 out.00010 prolog
      55975 out.00011 prolog
      63720 out.00012 prolog
      71217 out.00013 prolog
      75404 out.00014 prolog
      79358 out.00015 prolog
      83326 out.00016 prolog
      87104 out.00017 prolog
      90888 out.00018 prolog
      94756 out.00019 prolog
      98699 out.00020 prolog
     102804 out.00021 prolog
     106537 out.00022 prolog
     110015 out.00023 prolog
     113750 out.00024 prolog
     117753 out.00025 prolog
     122021 out.00026 prolog
     126277 out.00027 prolog
     129746 out.00028 prolog
     132839 out.00029 prolog
     136150 out.00030 prolog
     140462 out.00031 prolog
     144600 out.00032 prolog
     148781 out.00033 prolog
     152037 out.00034 prolog
     155257 out.00035 prolog
     158580 out.00036 prolog
     162379 out.00037 prolog
     166088 out.00038 prolog
     169529 out.00039 prolog
     172859 out.00040 prolog
     176340 out.00041 prolog
     180329 out.00042 prolog
     184182 out.00043 prolog
     187650 out.00044 prolog
     191011 out.00045 prolog
     194399 out.00046 prolog
     197757 out.00047 prolog
     201247 out.00048 prolog
     204729 out.00049 prolog
     208079 out.00050 prolog
     211316 out.00051 prolog
     214869 out.00052 prolog
     218730 out.00053 prolog
     222538 out.00054 prolog
     226689 out.00055 prolog
     230930 out.00056 prolog
     235133 out.00057 prolog
     239246 out.00058 prolog
     243418 out.00059 prolog
     248062 out.00060 prolog
     252833 out.00061 prolog
     257530 out.00062 prolog
     262077 out.00063 prolog
     266686 out.00064 prolog
     271244 out.00065 prolog
     275749 out.00066 prolog
     280553 out.00067 prolog
     285516 out.00068 prolog
     290487 out.00069 prolog
     295389 out.00070 prolog
     300290 out.00071 prolog
     304943 out.00072 prolog
     309542 out.00073 prolog
     314121 out.00074 prolog
     319207 out.00075 prolog
     324500 out.00076 prolog
     328657 out.00077 prolog
     332975 out.00078 prolog
     337067 out.00079 prolog
     341947 out.00080 prolog
     346923 out.00081 prolog
     351301 out.00082 prolog
     355614 out.00083 prolog
     360051 out.00084 prolog
     364202 out.00085 prolog
     368858 out.00086 prolog
     373229 out.00087 prolog
     377188 out.00088 prolog
     381736 out.00089 prolog
     386442 out.00090 prolog
     390553 out.00091 prolog
     395058 out.00092 prolog
     399367 out.00093 prolog
     403795 out.00094 prolog
     408248 out.00095 prolog
     412911 out.00096 prolog
     417611 out.00097 prolog
     422363 out.00098 prolog
     427270 out.00099 prolog
     432073 out.00100 prolog
     436938 out.00101 prolog
     441737 out.00102 prolog
     446443 out.00103 prolog
     451264 out.00104 prolog
     456158 out.00105 prolog
     458034 out.00106 prolog
blocks =        13
size =       200 cutoff =       500 all = 0
         0         0
     29554     29554
     59672     59672
     89818     89818
    128917    128917
    166272    166272
    202887    202887
    241204    241204
    281728    281728
    322513    322513
    363304    363304
    404378    404378
    444177    444177
    458034    458034

 

Config file

[General]
# list of files of the initial bas.h5 files
input_fofn = input.fofn
#input_fofn = preads.fofn

input_type = raw
#input_type = preads
job_type= local

# The length cutoff used for seed reads used for initial mapping
length_cutoff = 12000

# The length cutoff used for seed reads usef for pre-assembly
length_cutoff_pr = 12000


jobqueue = your_queue
sge_option_da = -pe smp 8 -q %(jobqueue)s
sge_option_la = -pe smp 2 -q %(jobqueue)s
sge_option_pda = -pe smp 8 -q %(jobqueue)s
sge_option_pla = -pe smp 2 -q %(jobqueue)s
sge_option_fc = -pe smp 24 -q %(jobqueue)s
sge_option_cns = -pe smp 8 -q %(jobqueue)s

pa_concurrent_jobs = 32
ovlp_concurrent_jobs = 32

pa_HPCdaligner_option =  -v -dal24 -t16 -e.70 -l1000 -s1000
ovlp_HPCdaligner_option = -v -dal24 -t32 -h60 -e.96 -l500 -s1000

pa_DBsplit_option = -x500 -s200
ovlp_DBsplit_option = -x500 -s200

falcon_sense_option = --output_multi --min_idt 0.70 --min_cov 4 --local_match_count_threshold 2 --max_n_read 200 --n_core 6

overlap_filtering_setting = --max_diff 100 --max_cov 100 --min_cov 20 --bestn 10 --n_core 40

 

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by Sudhir Jadhao60

How many bases are in the 1-preads_ovl/preads4falcon.fasta file? Two things stand out as being things to change, if the preads4falcon.fasta file does not have >15x of the expected genome size, then the length_cutoff and length_cutoff_pr parameters should be decreased, this will be dependent on your library quality and subread size. The second parameter that needs to be changed is the --min_cov in the overlap_filtering_setting I would set it at 2.

 

ADD REPLYlink written 3.8 years ago by rhall160

Hey thanks I got your point. Now I will rerun process with following parameter just tell me they are good to go. 

length_cutoff = 500

length_cutoff_pr = 2500

--min_cov 20

after completing process with above parameter I will get back to you.

And one more thing If you have any good reading material regarding falcon parameter for diploid genome please let me know

Thank you

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by Sudhir Jadhao60

I tried with new parameter Now I got below error

1) raise TaskFailureError("Counted %d failures." %failedJobCount)
TaskFailureError: 'Counted 1 failures.'
 No target specified, assuming "assembly" as target 

 

2) File "/home/bionivid-server/.local/lib/python2.7/site-packages/falcon_kit-0.4.0-py2.7-linux-x86_64.egg/falcon_kit/mains/run.py", line 447, in main1
    wf.refreshTargets(updateFreq = wait_time) # larger number better for more jobs, need to call to run jobs here or the # of concurrency is changed
  File "/usr/local/lib/python2.7/dist-packages/pypeflow-0.1.1-py2.7.egg/pypeflow/controller.py", line 531, in refreshTargets
    rtn = self._refreshTargets(task2thread, objs = objs, callback = callback, updateFreq = updateFreq, exitOnFailure = exitOnFailure)
  File "/usr/local/lib/python2.7/dist-packages/pypeflow-0.1.1-py2.7.egg/pypeflow/controller.py", line 706, in _refreshTargets
    raise TaskFailureError("Counted %d failures." %failedJobCount)
pypeflow.controller.TaskFailureError: 'Counted 1 failures.'

ADD REPLYlink written 3.8 years ago by Sudhir Jadhao60

500 and 2500 are too low for the cutoffs, this should be calculated as the sequence length for which ~30x of you expected genome size is covered.

 

ADD REPLYlink written 3.7 years ago by rhall160
1
gravatar for lexnederbragt
3.8 years ago by
lexnederbragt1.2k
Oslo, Norway
lexnederbragt1.2k wrote:

See my answer to a similar question here: A: Install HGAP for de novo PacBio Assembly.

ADD COMMENTlink written 3.8 years ago by lexnederbragt1.2k

My Genome is diploid plant genome. PBcR will work on it..?

ADD REPLYlink written 3.7 years ago by Sudhir Jadhao60
0
gravatar for thackl
3.8 years ago by
thackl2.6k
MIT
thackl2.6k wrote:

Have a look at DBG2OLC as an alternative.

If next to PacBio you also have Illumina data, that would allow you to look into hybrid correction and assembly...

ADD COMMENTlink written 3.8 years ago by thackl2.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1676 users visited in the last hour