Question: Pacbio genome assembly and polishing
0
gravatar for ambarishnag
4 months ago by
ambarishnag0 wrote:

Hi, I am new to pacbio sequencing analysis and I would really appreciate an answer to question I have about the polishing of an assembled genome assembly using QUIVER as mentioned in the Genome Assembly with Falcon in the Methods Section of the paper Paajanen et al., Scientific Data, 4:170149, 2017.

In the subsection titled "Genome assembly with FALCON" in the Methods Section, the genome polishing is described as below:

“We used QUIVER to polish the PacBio assembly using part of the SMRTANALYSIS 2.3.0p5 (pacb. com) pipeline using default settings and the combined PacBio data from 7 SMRT cells in native h5 format of PacBio. In preparation for polishing, we had to rename the scaffolds from the FALCON assembly, by concatenating the fasta names, using a custom script, called rename_falcon_github.py (github. com/paajanen/FC_scientific_data/). We uploaded both the renamed primary contigs and the renamed alternate contigs separately as a reference the polishing using referenceUploader --skipIndexUpdate -c -n "reference" -p /path/to/ working_directory -f /path/to/reference.fasta --saw = "sawriter -blt 8 -welter" --samIdx = "samtools faidx" --jobId = "Anonymous" --verbose. The polishing step was submitted with command smrtpipe.py --params=polishing_params.xml and the parameter file polishing_par- ams.xml is available at github.com/paajanen/FC_scientific_data/ “

I have used the FALCON software for genome sequencing from PacBio reads. The documentation for fc_unzip mentions that it runs quiver if the bam files of the subreads are provided.

Which is the preferred workflow - i) running fc_run followed by fc_unzip or ii) running fc_run followed by the quiver based polishing as described by Paajanen et al. ?

Also I installed smrtlinks-ttols using conda but this set of tools does not contain the referenceUploader mentioned by Paajanen et al. Where can I find this tool ?

Thanks in advance.

ADD COMMENTlink modified 4 months ago • written 4 months ago by ambarishnag0
1

In case you had not seen this nice tutorial: Polish PacBio assembly with latest PacBio tools : an affordable solution for everyone

ADD REPLYlink written 4 months ago by genomax65k

Hi gconcepcion, thank you very much, your help is much appreciated ! ambarishnag

ADD REPLYlink written 4 months ago by ambarishnag0

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.
Upvote|Bookmark|Accept

ADD REPLYlink written 4 months ago by WouterDeCoster38k
2
gravatar for gconcepcion
4 months ago by
gconcepcion60
Menlo Park, CA
gconcepcion60 wrote:

Hi ambarishnag:

The paper you mentioned performed a collapsed diploid assembly which was then subsequently polished with Quiver after running FALCON with fc_run. This is an acceptable pipeline if you are not planning on phasing your assembly. If you are planning on phasing your assembly however, then you should run fc_run, and subsequently fc_unzip.py to phase your genome. After phasing is complete, then your primary contigs & haplotigs can be polished.

You should install the latest version of FALCON / FALCON_unzip from bioconda like this:

$ conda install pb-assembly

Please see the bioconda instructions for pb-assembly

https://github.com/PacificBiosciences/pb-assembly

https://github.com/PacificBiosciences/pbbioconda

ADD COMMENTlink modified 4 months ago • written 4 months ago by gconcepcion60

Hi gconception, that's exactly what I have done. Thank you very much !

I would really appreciate your help with a related question:

I have subread data in fasta files from two cells, each file being about 5 to 6 G.

I used the following fc_run.cfg config file:

[General]

input_fofn = input.fofn

input_type = raw

\# The length cutoff used for seed reads used for initial mapping

length_cutoff = 1000

\# The length cutoff used for seed reads usef for pre-assembly

length_cutoff_pr = 1000


pa_HPCdaligner_option =  -v -B128 -M32

pa_daligner_option =  -e.70 -l1200 -s100 -k14 -h256 -w8

ovlp_HPCdaligner_option = -v -B128 -M32

ovlp_daligner_option = -h600 -e.96 -l1800 -s100 -k24

pa_DBsplit_option = -x500 -s400

ovlp_DBsplit_option = -x500 -s400

\# error correction consensus optione

falcon_sense_option = --output_multi --min_idt 0.70 --min_cov 4 --max_n_read 200 --n_core 24

\# overlap filtering options

overlap_filtering_setting = --max_diff 100 --max_cov 100 --min_cov 1 --bestn 10 --n_core 24

[job.defaults]

job_type = local

\#use_tmpdir = /scratch

pwatcher_type = blocking

job_type = string

submit = bash -C ${CMD} >| ${STDOUT_FILE} 2>| ${STDERR_FILE}

NPROC=4

\#njobs=1

[job.step.da]

[job.step.pda]

[job.step.la]

\#NPROC=1

[job.step.pla]

\#NPROC=1

[job.step.cns]

\#njobs=1

NPROC=24

[job.step.asm]

NPROC=32

However, I noticed that the a-contig file (a_ctg.fa) is much smaller than the p-contig file (p_ctg.fa) in the FALCON output (349K vs 112M) and consequently the consensus haplotig file (cns_h_ctg.fasta) is much smaller than the consensus primary contain file (cns_p_ctg.fasta) in the FALCON-Unzip output (4.6M vs 113M). Is that a valid result or is that indicative of improper parameter values or some other artifact ?

Thanks a lot in advance. ambarishnag.

ADD REPLYlink modified 4 months ago by genomax65k • written 4 months ago by ambarishnag0
1

This is typical and more common in diploid species with low heterozygosity. You haven't mentioned the the ploidy, sequencing coverage, or anything about the biology of your species of interest so I can't be sure exactly what's going on here.

Assuming a diploid species, 113Mb primary contig file with a 4.6Mb haplotig tells me: 1) You have a highly inbred species/individual with a genome size of ~113Mb or so OR 2) You have a diploid species with low sequencing coverage, insufficient for phasing. (low coverage; significantly less than 50X)

ADD REPLYlink written 4 months ago by gconcepcion60

Hi gconcepcion, thank you very much. I an running into issues with running fc_unzip when I am using a different set of parameters length_cutoff = 2000 and length_cutoff_pr = 2000. Falcon runs fine but fc_unzip crashes with the following error: How do I avoid this error ? Your help is much appreciated. Thanks in advance, ambarishnag

[INFO 2018-12-02 11:32:46] Creating a haplotig graph. [INFO 2018-12-02 11:32:46] - Adding nodes. [INFO 2018-12-02 11:32:46] - region_id = 0, region_type = simple, region_pos_start = 0, region_pos_end = 20647 [INFO 2018-12-02 11:32:46] - [haplotig graph, adding node] key = 000412F_simple_0_base [INFO 2018-12-02 11:32:46] - [haplotig graph, adding node] key = 000412F-001-01 [INFO 2018-12-02 11:32:46] - Adding edges. [INFO 2018-12-02 11:32:46] - Hashing haplotigs. [INFO 2018-12-02 11:32:46] - Writing the haplotig graph in the gexf format. [INFO 2018-12-02 11:32:46] Writing the haplotig_graph.gfa. [INFO 2018-12-02 11:32:46] - Writing all the haplotigs to disk in haplotigs.fasta. [INFO 2018-12-02 11:32:46] Beginning to extract all p_ctg and h_ctg. [INFO 2018-12-02 11:32:46] Extracting primary contig: p_ctg_id = 000412F [INFO 2018-12-02 11:32:46] Making the haplotig segment coordinate relation lookup. [INFO 2018-12-02 11:32:46] haplotig_segment_coords = {0: 0, 20647: 20647}

[INFO 2018-12-02 11:32:46] Extracting the associate haplotigs for p_ctg_id = 000412F [ERROR 2018-12-02 11:32:46] Failure in generate_haplotigs_for_ctg((u'000412F', u'../../0-phasing/000412F/uow-00/proto', './uow-000412F', u'../../..', False)) Traceback (most recent call last): File "/rufous/anag/.conda-envs/denovo_asm/lib/python2.7/site-packages/falcon_unzip/mains/graphs_to_h_tigs_2.py", line 74, in run_generate_haplotigs_for_ctg return generate_haplotigs_for_ctg(ctg_id, allow_multiple_primaries, out_dir, unzip_dir, proto_dir, logger) File "/rufous/anag/.conda-envs/denovo_asm/lib/python2.7/site-packages/falcon_unzip/mains/graphs_to_h_tigs_2.py", line 256, in generate_haplotigs_for_ctg extract_and_write_all_ctg(ctg_id, haplotig_graph, all_haplotig_dict, phase_alias_map, out_dir, allow_multiple_primaries, fp_proto_log) File "/rufous/anag/.conda-envs/denovo_asm/lib/python2.7/site-packages/falcon_unzip/mains/graphs_to_h_tigs_2.py", line 959, in extract_and_write_all_ctg raise Exception(msg) Exception: Skipping additional subgraphs of the primary contig: 000412F. The graph has multiple primary components. Traceback (most recent call last): File "/rufous/anag/.conda-envs/denovo_asm/lib/python2.7/runpy.py", line 174, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/rufous/anag/.conda-envs/denovo_asm/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/rufous/anag/.conda-envs/denovo_asm/lib/python2.7/site-packages/falcon_unzip/mains/graphs_to_h_tigs_2.py", line 1495, in <module> main() File "/rufous/anag/.conda-envs/denovo_asm/lib/python2.7/site-packages/falcon_unzip/mains/graphs_to_h_tigs_2.py", line 1491, in main args.func(args) File "/rufous/anag/.conda-envs/denovo_asm/lib/python2.7/site-packages/falcon_unzip/mains/graphs_to_h_tigs_2.py", line 1253, in cmd_apply result = run_generate_haplotigs_for_ctg(exe) File "/rufous/anag/.conda-envs/denovo_asm/lib/python2.7/site-packages/falcon_unzip/mains/graphs_to_h_tigs_2.py", line 74, in run_generate_haplotigs_for_ctg return generate_haplotigs_for_ctg(ctg_id, allow_multiple_primaries, out_dir, unzip_dir, proto_dir, logger) File "/rufous/anag/.conda-envs/denovo_asm/lib/python2.7/site-packages/falcon_unzip/mains/graphs_to_h_tigs_2.py", line 256, in generate_haplotigs_for_ctg extract_and_write_all_ctg(ctg_id, haplotig_graph, all_haplotig_dict, phase_alias_map, out_dir, allow_multiple_primaries, fp_proto_log) File "/rufous/anag/.conda-envs/denovo_asm/lib/python2.7/site-packages/falcon_unzip/mains/graphs_to_h_tigs_2.py", line 959, in extract_and_write_all_ctg raise Exception(msg) Exception: Skipping additional subgraphs of the primary contig: 000412F. The graph has multiple primary components. 2018-12-02 11:32:46,914 - root - WARNING - Call '/bin/bash user_script.sh' returned 256. 2018-12-02 11:32:46,915 - root - WARNING - CD: '/rufous/anag/pacbio/FALCON-examples/run/eric_data/3-unzip/2-htigs/chunk_000412F' -> '/rufous/anag/pacbio/FALCON-examples/run/eric_data/3-unzip/2-htigs/chunk_000412F' 2018-12-02 11:32:46,915 - root - WARNING - CD: '/rufous/anag/pacbio/FALCON-examples/run/eric_data/3-unzip/2-htigs/chunk_000412F' -> '/rufous/anag/pacbio/FALCON-examples/run/eric_data/3-unzip/2-htigs/chunk_000412F' 2018-12-02 11:32:46,916 - root - CRITICAL - Error in /rufous/anag/.conda-envs/denovo_asm/lib/python2.7/site-packages/pypeflow/do_task.py with args="{'json_fn': '/rufous/anag/pacbio/FALCON-examples/run/eric_data/3-unzip/2-htigs/chunk_000412F/task.json',\n 'timeout': 30,\n 'tmpdir': None}"

ADD REPLYlink written 4 months ago by ambarishnag0
1

You may want to post long logs like this at pastebin.com and then post the relevant link here. There is a 5K character limit on biostars posts and having jumbled text like this (if I try to format the post it is going over the 5K limit) makes it hard to figure out what is going on.

ADD REPLYlink written 4 months ago by genomax65k

Hi @genomax, here is the link the error message: fc_unzip error

Please let me know if you cannot access the link. Your help with troubleshooting this issue is much appreciated. Thanks, ambarishnag

Also, I wanted to add the fc_unzip.cfg file:

[Unzip]

input_fofn=input.fofn

input_bam_fofn=input_bam.fofn

[job.defaults]

NPROC=4

njobs=7

#job_type = SGE

job_type = local

#use_tmpdir = /scratch

pwatcher_type = blocking

job_type = string

submit = bash -C ${CMD} >| ${STDOUT_FILE} 2>| ${STDERR_FILE}

#njobs=120

njobs=8

NPROC=4

[job.step.unzip.track_reads]

njobs=1

NPROC=48

[job.step.unzip.blasr_aln]

njobs=2

NPROC=16

[job.step.unzip.phasing]

njobs=16

NPROC=2

[job.step.unzip.hasm]

njobs=1

NPROC=48

[job.step.unzip.quiver]

njobs=2

NPROC=12

ADD REPLYlink modified 4 months ago • written 4 months ago by ambarishnag0
1

This issue has already been addressed here: https://github.com/PacificBiosciences/pbbioconda/issues/45

In the future, for troubleshooting FALCON pipeline assembly problems, please file an issue at the PacBio pbbioconda software issue tracker here:

https://github.com/PacificBiosciences/pbbioconda/issues

ADD REPLYlink modified 4 months ago • written 4 months ago by gconcepcion60

Hi gconcepion, thanks very much. I updated my pb-falcon to 0.2.4 and running fc_unzip right now, hopefully fc_unzip will not crash. Thanks so much ! ambarishnag.

ADD REPLYlink written 4 months ago by ambarishnag0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1194 users visited in the last hour