Question: 10x Supernova de novo assembly
2
gravatar for igor
2.8 years ago by
igor8.3k
United States
igor8.3k wrote:

I am working on my first 10x Genomics de novo assembly. For those who may not be familiar with it, 10x adds barcodes to Illumina short reads coming from the same long fragments that are then used to reconstruct long "linked" reads. Since these are non-standard "linked" reads and the technology is fairly new, I don't think there are any tools available for processing the data other than the official Supernova software. Has anyone had luck with alternate assemblers for 10x data?

The reason I am asking is because Supernova is essentially failing for me. I used one HiSeq lane of input, which is fairly reasonable. Based on my discussions with 10x, that amount of data should take about a week to process. I had it going for 4 weeks and it did not finish. Then, the server had to be restarted, so the job was killed. Even half a lane did not finish after several weeks. If I use a small subset of data (like 10M or 100M reads), the process finishes. The results are terrible, but at least it shows there aren't any problems with the dependencies or the environment. I was wondering if anyone here had run into any similar problems and if they had any solutions. Normally, if a certain tool has issues, you can try a different one, but that does not seem to be the case here.

assembly 10x • 4.0k views
ADD COMMENTlink modified 2.2 years ago by from the mountains70 • written 2.8 years ago by igor8.3k

Have you tried other long-read assemblers, like Celera, or tools intended for PacBio?

ADD REPLYlink written 2.8 years ago by Brian Bushnell16k

I think the issue is getting a hold of gem specific pools of reads. I am still familiarizing myself with some 10x data.

ADD REPLYlink written 2.8 years ago by genomax71k

Exactly. If I could get to long reads, that would be fantastic. There are basically two steps: converting short reads to "long" reads and then assembling those reads. Unfortunately, those two steps are combined in their software and there is no way to just do one or the other.

ADD REPLYlink written 2.8 years ago by igor8.3k

Oh, I see. Well, good luck :) If I had access to some 10x data, I might be able to write something that converted it to long reads via Tadpole, but I've never seen any 10x data.

ADD REPLYlink written 2.8 years ago by Brian Bushnell16k

They post some examples here if you want to check: http://support.10xgenomics.com/de-novo-assembly/datasets

I think the biggest problem is that the whole process is not described in detail (for example, the barcode sequences aren't published as far as I know), so it would probably require some reverse engineering.

ADD REPLYlink written 2.8 years ago by igor8.3k

Have you tried longranger mkfastq followed by longranger basic? That yields a single file of interleaved fastq's labeled with barcodes. Order of R1/R2 is not guaranteed and R1/R2 markers are not there in the headers. It may be better to do something with the files from longranger mkfastq.

ADD REPLYlink written 2.8 years ago by genomax71k

That's an interesting idea. I never checked longranger documentation since it's a different workflow. longranger mkfastq is basically a bcl2fastq wrapper and just complicates things if you are already familiar with bcl2fastq. longranger basic sounds promising, though. Even if it works as expected, trying to assemble the linked reads yourself is not a trivial undertaking.

ADD REPLYlink written 2.8 years ago by igor8.3k

Did the server you were using meet their required specifications? Performing assembly on a full lane will take much, much more RAM and CPU than a fifth (100M reads). I've had good luck with their assembly software but it needs ample resources to do its job properly.

ADD REPLYlink written 2.8 years ago by Dan D6.8k

I ran it with 16 threads and 512GB RAM, which is more than they require. Also, there would probably be a memory-related error if that was a problem.

Nice to hear that someone else got it to work, though.

ADD REPLYlink modified 2.8 years ago • written 2.8 years ago by igor8.3k
1

16 threads and 512MB RAM

That must be 512GB :)

ADD REPLYlink written 2.8 years ago by genomax71k

You're right. It's been a long week.

ADD REPLYlink written 2.8 years ago by igor8.3k
1
gravatar for genomax
2.6 years ago by
genomax71k
United States
genomax71k wrote:

Was able to complete a supernova run in less than 2 days for a human sample (from a cell line). I had about 650M reads from HiSeq 4000. I used 12 cores on a node that had a lot of RAM. I did not watch the process continuously but the memory usage stayed below ~1TB.

ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by genomax71k

Sounds reasonable. Happy to hear it's working as expected for someone.

The run I started in December with 16 threads and 512 GB memory with one full HiSeq 4000 lane of data is still going. I am sure there is something odd about the library.

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by igor8.3k

It needed more than 512G so perhaps that is the problem you are facing. Is the job doing anything at all (1+ month is odd)?

ADD REPLYlink written 2.6 years ago by genomax71k

Still going. Only 200% CPU, but many processes.

I doubt it's the memory. When I tried it with lower limits, it complained fairly quickly.

ADD REPLYlink written 2.6 years ago by igor8.3k
1

If it is of any help , I used longranger wgs pipeline on high mem nodes(lsf) gave 200Gb ram data sequenced was human wgs (40x coverage) on hiseq x ten run time was 15 days. their lariat aligner i guess is written in python is the most time consuming step. And you are right I tried using normal nodes and it would die on me because of memory issues. In any case the run time can be more in case of de novo.

ADD REPLYlink written 2.2 years ago by badribio240

I just ran the mkoutput. Took ~10 mins. We are going to collect more data for this sample so I will post an update to see if the assembly improves.

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by genomax71k

In case anyone is wondering, my Supernova run finally finished. It took 6 weeks. Obviously something is wrong since the output is essentially useless (0.02 Gb assembly size), but at least I don't have to wonder anymore. This is the summary report:

--------------------------------------------------------------------------------
INPUT
-  450.81 M   = READS          = number of reads; ideal 800-1200 for human
-  139.00 b   = MEAN READ LEN  = mean read length after trimming; ideal 140
-   13.19 x   = EFFECTIVE COV  = effective read coverage; ideal ~42 for nominal 56x cov
-   58.48 %   = READ TWO Q30   = fraction of Q30 bases in read 2; ideal 75-85
-    0.35 kb  = MEDIAN INSERT  = median insert size; ideal 0.35-0.40
-   81.86 %   = PROPER PAIRS   = fraction of proper read pairs; ideal >=75
-   18.09 kb  = MOLECULE LEN   = weighted mean molecule size; ideal 50-100
-    0.00 kb  = HETDIST        = mean distance between heterozygous SNPs
-   11.94 %   = UNBAR          = fraction of reads that are not barcoded
-  360.00     = BARCODE N50    = N50 reads per barcode
-   31.63 %   = DUPS           = fraction of reads that are duplicates
-   21.96 %   = PHASED         = nonduplicate and phased reads; ideal 45-50
--------------------------------------------------------------------------------
OUTPUT
-    1.55 K   = LONG SCAFFOLDS = number of scaffolds >= 10 kb
-    0.50 kb  = EDGE N50       = N50 edge size
-   10.22 kb  = CONTIG N50     = N50 contig size
-    0.00 Mb  = PHASEBLOCK N50 = N50 phase block size
-    0.01 Mb  = SCAFFOLD N50   = N50 scaffold size
-    0.01 Mb  = SCAFFOLD N60   = N60 scaffold size
-    0.02 Gb  = ASSEMBLY SIZE  = assembly size (only scaffolds >= 10 kb)
--------------------------------------------------------------------------------
ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by igor8.3k

Was this a human sample? One I had ended up at ~1.6 Gb.

ADD REPLYlink written 2.6 years ago by genomax71k

No. It was a different species. Should be around 0.8 Gb.

So does it mean half the genome was missing for you? Is that normal?

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by igor8.3k

My guess is that your heterozygosity level is above 1%, and we see that SuperNova really doesn't like that. What does genomescope say about your reads ? (please use unbarcoded reads, otherwise genomescope will take the barcodes into the calculation)

ADD REPLYlink written 2.3 years ago by jvhaarst0
1
gravatar for from the mountains
2.2 years ago by
United States
from the mountains70 wrote:

They've just come out with a technical note on smaller assemblies: https://support.10xgenomics.com/de-novo-assembly/sample-prep/doc/technical-note-guidelines-for-de-novo-assembly-of-genomes-smaller-than-~3-gb-using-10x-genomics-supernova-v12

ADD COMMENTlink written 2.2 years ago by from the mountains70
2

More importantly, they came out with a new version of the reagents (v2). We had multiple failed assemblies with v1, but all worked with v2.

ADD REPLYlink written 2.2 years ago by igor8.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1621 users visited in the last hour