I have a draft genome assembly constructed from a single Illumina library. I have some 10x linked reads for the same genome, too. I want to use the 10x reads to scaffold the Illumina assembly.
I've identified two pieces of software - ARCS and Scaff10x - which should be able to do the job. I've got them both installed but I've had no luck with either of them.
What I have been searching unsuccessfully for, and what I would dearly love to get as a response to this post, is a step-by-step guide to getting either of these programs to do what I want. Given that that's obviously a big ask, I'll detail the specific problems I've been having below in the hope that someone can spot a solution.
Going by the ARCS documentation I'm not even able to get the process off the ground and running. The GitHub repo says I should read a makefile which will show me how the whole pipeline should be run. I can make neither head nor tail of it (I've only ever used makefiles for compiling software). There is also an example bash script to run the pipeline but it's equally mystifying to me and is uncommented.
I have had slightly more luck with Scaff10x but am nowhere near understanding how to to fix the errors I'm getting. I've had three attempts at using Scaff10x, all of which failed differently. First I tried giving the main scaff10x binary my Illumina assembly and a file listing where my 10x reads are (input.dat):
full/path/to/scaff10x -nodes 8 -longread 0 -data input.dat \ path/to/illumina_assembly.fasta \ path/to/output/scaffolded/assembly.fasta
This ran fine but produced a "scaffolded" fasta file with the same number of sequences as the unscaffolded draft assembly (i.e. no scaffolding).
I then tried running the scaff_reads binary on the 10x reads file. This program is supposed to be run first by the scaff10x binary, to format the 10x fastq files correctly. I did this mostly as an experiment to see what would happen:
full/path/to/scaff_reads input.dat \ path/to/output/reads-BC_1.fastq path/to/output/reads-BC_2.fastq > try.out
(try.out is just a log file). This process ran without issue but the reads-BC_n.fastq files don't match up with the input fastq files:
Input fastq files: Total R1: 392639440 Total R2: 392639440 Output fastq files: Total R1: 38385481 Total R2: 45072386
Several hundred million reads have gone missing and the numbers of reads in the R1 and R2 output fastq files are different.
Finally I tried providing a bam file to the scaff10x binary which was produced by 10x's LongRanger, aligning (and sorting and indexing) my 10x reads to my draft Illumina assembly. LongRanger makes use of the barcoding on the 10x reads and providing a bam file directly means that I circumvent whatever is going so wrong with the scaff_reads binary:
/full/path/to/scaff10x \ -bam /full/path/to/possorted_bam.bam \ -nodes 8 -longread 0 \ path/to/illumina_assembly.fasta \ path/to/output/scaffolded/assembly.fasta
This ran and returned exit status 0. Inspecting the results shows that it should have returned some other exit status! From stderr:
sh: line 1: 15523 Segmentation fault
/data/genomicsocorg/mwhj1/Programs/Scaff10X/src/scaff-bin/scaff_bwa-barcode tarseq.tag align0.dat align.dat > try.out /data/genomicsocorg/mwhj1/Programs/Scaff10X/src/scaff-bin/scaff_fastq: : Unknown error -1417898056 sh: line 1: 23607 Segmentation fault
/data/genomicsocorg/mwhj1/Programs/Scaff10X/src/scaff-bin/scaff_output -longread 0 -gap 100 genome.fastq contig.dat2 genome2.fastq genome2.agp > try.out /data/genomicsocorg/mwhj1/Programs/Scaff10X/src/scaff-bin/scaff_rename: : Unknown error -1417898056 mv: cannot stat 'genome.fasta': No such file or directory mv: cannot stat 'genome-all.agp': No such file or directory
So that's where I'm at, and I would be deeply grateful to anyone who could lend me any advice or wisdom on how to proceed. If you've read this far, thanks.