Question

How long to read files

1

Entering edit mode

8.7 years ago

rick.westerman ▴ 50

I have been using ABySS for many years and it seems like it is taking longer to read in the sequencing files and produce the first coverage.hist file. Or perhaps I am just getting bigger projects. My current project is 160 Gbase (250bp reads) of a mammalian genome (thus ~3 Gbase or ~50x coverage). I am using a fairly new Intel-based 20-cpu, 256 GB machine but also have access to older and slower machines. I am running ABySS v.1.9 with the v=-vv option (for logging purposes). Other parameters include s=202 n=10 l=40 plus k=XXX.

Extrapolating from the log file it looks like it will take about 10 days to read in the pair-end files and then comes all of the processing time it will take to produce scaffolds. A different mammalian project I did a couple of months ago with 100 Gbase of 100bp reads took, as I recall, about 5 to 6 days to read in the files thus the 10 day estimate from the log file appears to be valid.

I can run either using a single R1 and R2 file as input in which case two CPUs are active while the others sit or I can split up the R1/R2 files into multiple parts and get all CPUs to read in the files. Splitting up the files doesn't seem to help out the speed issue very much. The files are on a fast NFS "scratch" space. Tests with a sub-set of the data on a RAMdisk also doesn't improve the speed too much. Different compile options don't see to make much different either.

I usually do parameter sweeps with different k-values however it is hard to do so when running ABySS takes weeks. I am getting irritated by this. Anyway, my questions:

Is a 10-day read time reasonable for 160 Gbase?
Are there any parameters I can try tweaking in order to speed up the loading process?

From limited timing tests (done via hacking the code) it appears that the time is mostly being used up in adding kmers to the initial graph (seqCollection->add) and not in reading the file itself. Occasionally I will also get spikes in the seqCollection->pumpNetwork() call which I find strange. However given my limited c++ hacking skills those tests are subject to question.

abyss • 2.6k views

ADD COMMENT • link updated 19 months ago by Ram 43k • written 8.7 years ago by rick.westerman ▴ 50

0

Entering edit mode

Hi Rick,

No, 10 days to read in 160 Gbp of sequencing data is not at all normal in my experience. I can understand why you would be annoyed. Roughly speaking, I might expect it to take ~ 12 hours to load the data and to be much faster than that if you split up the files.

Notwithstanding your experiment with the RAMFS, I would highly suspicious of your filesystem. If I were in your position, I would try to find some way to benchmark your file I/O and compare those results to what is expected for a typical high performance computing cluster.

Besides file splitting, I'm not aware of anything else you could do to speed things up.

ADD REPLY • link updated 19 months ago by Ram 43k • written 8.7 years ago by benv ▴ 730

Ram · Answer 1 · 2015-08-24

1

Entering edit mode

8.7 years ago

rick.westerman ▴ 50

I think I have traced down the problem to how Google sparsehash performs under high kmer loads. When it has to resize and rehash then performance drops. If it has to do this multiple times during the read process then ABySS's performance suffers a lot.

250bp reads have a lot of errors at the 3' end. This is creating a lot of unique kmers which appear to overflow the default 200M rehash (found in Assembly/DBG.h). Increasing this value to 800M or 1G allows me to read in 1/6th of the data in slightly over 2 hours which extrapolates to a reasonable 12 hours for the full data. Unfortunately the full data set is still giving me problems but this seems to stem from a simple lack of memory.

With the 1/6 data I have 3.6G unique kmers (count=1) and another ~2.1G kmers with a count > 1. Other projects that I have done do not have this pattern. E.g., a bird had 1.1G count=1 and 1.4G count > 1 and a different mammal 2.3G count=1 and 2.5G count > 1. That last project was difficult as well, as I recall.

I am thinking that since I have large memory machines then my best best -- aside from cleaning up data pre-ABySS -- is to compile ABySS with a large rehash. It would be nice to have this as a command line parameter if at all possible.

ADD COMMENT • link updated 19 months ago by Ram 43k • written 8.7 years ago by rick.westerman ▴ 50

1

Entering edit mode

If the problem is error kmers, you could try using BBNorm to throw away high-error-rate reads, or Tadpole to error-correct them. Either one should be pretty fast...

If the library was overlapping, you can also error-correct the overlapping bases with BBMerge, which can reduce the error rate somewhat.

For example -

To discard error-prone reads with an apparent depth of under 3, using fixed memory:

bbnorm.sh in=reads.fq out=highpass.fq target=9999 min=3 bits=16 prefilter passes=1

Note that2-pass normalization with BBNorm would reduce the error rate by selectively discarding low-quality reads, but it won't work in this case since you only have ~50x coverage - that requires at least triple your target coverage.

To error-correct by overlap (very fast and low-memory):

bbmerge.sh in=reads.fq out=corrected.fq ecco

To error-correct by kmer-counting (some of these flags require version 35.37+), using ~30 bytes/kmer with count over 2:

tadpole.sh in=reads.fq out=corrected.fq prefilter=2 mode=correct k=50 prealloc prepasses=auto shave rinse

Of course, the resulting assemblies will no longer really be pure ABySS assemblies... but at least they'd (potentially) finish quickly.

ADD REPLY • link updated 19 months ago by Ram 43k • written 8.7 years ago by Brian Bushnell 20k

0

Entering edit mode

Brian:

Thanks for advice. I was trying out your ecc.sh (aka, bbnorm) to do error correcting but just now having read your post on SeqAnswers in regards to tadpole I am anxious to try it out. Both for error correction as well as contig assembly itself.

ADD REPLY • link updated 19 months ago by Ram 43k • written 8.7 years ago by rick.westerman ▴ 50

0

Entering edit mode

By the way, this data set is 2x250 base reads generated from 1 flow cell of the new v2 Illumina rapid chemistry from an Illumina PCR-free library made from mammalian DNA. Although the quality is high, the error rate is expected to increase along the read, and it does.

image: ErrorFreqCycle

This is an image from Illumina's "Sequence Analysis Viewer" (SAV) for the run used to generate this data. The x-axis is the cycle (base) number. "R1" and "R2" refer to the respective read numbers. With Q30 denoting 10X more errors than Q40 one could see the second 125 base of each read generating 10X the number of entries in the hash table compared to the first 125 bases.

Rick also ran the data using the Broad's "Discovar-de novo" and it looked quite good. (107kb N50 with just PE, 1.9Mbp N50 after scaffolding with 2 lanes of an Illumina TruSeq "no gel" mate pair library.) Although it peaked at 420GB of RAM. We have used ABySS for years though and have a certain amount of trust in it. So we would prefer to validate the Discovar assembly using ABySS.

--

Phillip

ADD REPLY • link updated 19 months ago by Ram 43k • written 8.7 years ago by pmiguel • 0