Question

Help scaffolding an Illumina genome assembly with 10x reads

2

Entering edit mode

5.4 years ago

maxwhjohn1988 ▴ 130

Hi everyone

I have a draft genome assembly constructed from a single Illumina library. I have some 10x linked reads for the same genome, too. I want to use the 10x reads to scaffold the Illumina assembly.

I've identified two pieces of software - ARCS and Scaff10x - which should be able to do the job. I've got them both installed but I've had no luck with either of them.

What I have been searching unsuccessfully for, and what I would dearly love to get as a response to this post, is a step-by-step guide to getting either of these programs to do what I want. Given that that's obviously a big ask, I'll detail the specific problems I've been having below in the hope that someone can spot a solution.

Arcs

Going by the ARCS documentation I'm not even able to get the process off the ground and running. The GitHub repo says I should read a makefile which will show me how the whole pipeline should be run. I can make neither head nor tail of it (I've only ever used makefiles for compiling software). There is also an example bash script to run the pipeline but it's equally mystifying to me and is uncommented.

Scaff10x

I have had slightly more luck with Scaff10x but am nowhere near understanding how to to fix the errors I'm getting. I've had three attempts at using Scaff10x, all of which failed differently. First I tried giving the main scaff10x binary my Illumina assembly and a file listing where my 10x reads are (input.dat):

full/path/to/scaff10x -nodes 8 -longread 0 -data input.dat \
path/to/illumina_assembly.fasta \
path/to/output/scaffolded/assembly.fasta

This ran fine but produced a "scaffolded" fasta file with the same number of sequences as the unscaffolded draft assembly (i.e. no scaffolding).

I then tried running the scaff_reads binary on the 10x reads file. This program is supposed to be run first by the scaff10x binary, to format the 10x fastq files correctly. I did this mostly as an experiment to see what would happen:

full/path/to/scaff_reads input.dat \
path/to/output/reads-BC_1.fastq path/to/output/reads-BC_2.fastq > try.out

(try.out is just a log file). This process ran without issue but the reads-BC_n.fastq files don't match up with the input fastq files:

Input fastq files: 
Total R1: 392639440 Total R2: 392639440

Output fastq files: 
Total R1: 38385481 Total R2: 45072386

Several hundred million reads have gone missing and the numbers of reads in the R1 and R2 output fastq files are different.

Finally I tried providing a bam file to the scaff10x binary which was produced by 10x's LongRanger, aligning (and sorting and indexing) my 10x reads to my draft Illumina assembly. LongRanger makes use of the barcoding on the 10x reads and providing a bam file directly means that I circumvent whatever is going so wrong with the scaff_reads binary:

/full/path/to/scaff10x \
-bam /full/path/to/possorted_bam.bam \
-nodes 8 -longread 0 \
path/to/illumina_assembly.fasta \
path/to/output/scaffolded/assembly.fasta

This ran and returned exit status 0. Inspecting the results shows that it should have returned some other exit status! From stderr:

sh: line 1: 15523 Segmentation fault
/data/genomicsocorg/mwhj1/Programs/Scaff10X/src/scaff-bin/scaff_bwa-barcode tarseq.tag align0.dat align.dat > try.out /data/genomicsocorg/mwhj1/Programs/Scaff10X/src/scaff-bin/scaff_fastq: : Unknown error -1417898056 sh: line 1: 23607 Segmentation fault
/data/genomicsocorg/mwhj1/Programs/Scaff10X/src/scaff-bin/scaff_output -longread 0 -gap 100 genome.fastq contig.dat2 genome2.fastq genome2.agp > try.out /data/genomicsocorg/mwhj1/Programs/Scaff10X/src/scaff-bin/scaff_rename: : Unknown error -1417898056 mv: cannot stat 'genome.fasta': No such file or directory mv: cannot stat 'genome-all.agp': No such file or directory

So that's where I'm at, and I would be deeply grateful to anyone who could lend me any advice or wisdom on how to proceed. If you've read this far, thanks.

Assembly genome next-gen • 2.6k views

ADD COMMENT • link updated 4.3 years ago by adc0032 ▴ 20 • written 5.4 years ago by maxwhjohn1988 ▴ 130

0

Entering edit mode

I've found a paper which goes into considerable detail with examples on how they ran Scaff10x. Section 3.2. I'm going to give this a try and will report back.

https://przeworskilab.com/wp-content/uploads/acropora-millepora-assembly.pdf

ADD REPLY • link 5.4 years ago by maxwhjohn1988 ▴ 130

0

Entering edit mode

Did you gather any useful knowledge about the best way to use 10X reads? All the software I've tried seems to act up.

ADD REPLY • link 4.8 years ago by predeus ★ 2.0k

score 0 · Answer 1 · 2020-05-13

Have you found a solution to your problem?

Have you processed your 10X reads with longranger basic?

I have been using arcs to scaffold contigs from a supernova assembly using the same 10X reads it was created from. I ran arcs within tigmint, which has a make file that can be run with pretty simple commands. Brian Faircloth also has information on getting those programs compiled and running (although i think he uses ARKS).