Question: 10x Supernova de novo assembly
2
gravatar for igor
3.7 years ago by
igor11k
United States
igor11k wrote:

I am working on my first 10x Genomics de novo assembly. For those who may not be familiar with it, 10x adds barcodes to Illumina short reads coming from the same long fragments that are then used to reconstruct long "linked" reads. Since these are non-standard "linked" reads and the technology is fairly new, I don't think there are any tools available for processing the data other than the official Supernova software. Has anyone had luck with alternate assemblers for 10x data?

The reason I am asking is because Supernova is essentially failing for me. I used one HiSeq lane of input, which is fairly reasonable. Based on my discussions with 10x, that amount of data should take about a week to process. I had it going for 4 weeks and it did not finish. Then, the server had to be restarted, so the job was killed. Even half a lane did not finish after several weeks. If I use a small subset of data (like 10M or 100M reads), the process finishes. The results are terrible, but at least it shows there aren't any problems with the dependencies or the environment. I was wondering if anyone here had run into any similar problems and if they had any solutions. Normally, if a certain tool has issues, you can try a different one, but that does not seem to be the case here.

assembly 10x • 4.7k views
ADD COMMENTlink modified 9 months ago by samnioue0 • written 3.7 years ago by igor11k

Have you tried other long-read assemblers, like Celera, or tools intended for PacBio?

ADD REPLYlink written 3.7 years ago by Brian Bushnell17k

I think the issue is getting a hold of gem specific pools of reads. I am still familiarizing myself with some 10x data.

ADD REPLYlink written 3.7 years ago by genomax87k

Exactly. If I could get to long reads, that would be fantastic. There are basically two steps: converting short reads to "long" reads and then assembling those reads. Unfortunately, those two steps are combined in their software and there is no way to just do one or the other.

ADD REPLYlink written 3.7 years ago by igor11k

Oh, I see. Well, good luck :) If I had access to some 10x data, I might be able to write something that converted it to long reads via Tadpole, but I've never seen any 10x data.

ADD REPLYlink written 3.7 years ago by Brian Bushnell17k

They post some examples here if you want to check: http://support.10xgenomics.com/de-novo-assembly/datasets

I think the biggest problem is that the whole process is not described in detail (for example, the barcode sequences aren't published as far as I know), so it would probably require some reverse engineering.

ADD REPLYlink written 3.7 years ago by igor11k

Have you tried longranger mkfastq followed by longranger basic? That yields a single file of interleaved fastq's labeled with barcodes. Order of R1/R2 is not guaranteed and R1/R2 markers are not there in the headers. It may be better to do something with the files from longranger mkfastq.

ADD REPLYlink written 3.7 years ago by genomax87k

That's an interesting idea. I never checked longranger documentation since it's a different workflow. longranger mkfastq is basically a bcl2fastq wrapper and just complicates things if you are already familiar with bcl2fastq. longranger basic sounds promising, though. Even if it works as expected, trying to assemble the linked reads yourself is not a trivial undertaking.

ADD REPLYlink written 3.7 years ago by igor11k

Did the server you were using meet their required specifications? Performing assembly on a full lane will take much, much more RAM and CPU than a fifth (100M reads). I've had good luck with their assembly software but it needs ample resources to do its job properly.

ADD REPLYlink written 3.7 years ago by Dan D7.1k

I ran it with 16 threads and 512GB RAM, which is more than they require. Also, there would probably be a memory-related error if that was a problem.

Nice to hear that someone else got it to work, though.

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by igor11k
1

16 threads and 512MB RAM

That must be 512GB :)

ADD REPLYlink written 3.7 years ago by genomax87k

You're right. It's been a long week.

ADD REPLYlink written 3.7 years ago by igor11k

Hi Igor and genomax,

I hope you can still see this message. I am just going to start using supernova. I am having a hard time using it. Can you please suggest articles, websites or anything that can help to use it. or if you can tell me the steps that I should since you seem to know how to work with it. Thank you so much in advance. Looking forward to your reply.

ADD REPLYlink modified 9 months ago • written 9 months ago by samnioue0

Official resource is here: https://support.10xgenomics.com/de-novo-assembly/software/pipelines/latest/using/running

ADD REPLYlink written 9 months ago by Juke344.5k

samnioue : supernova can be tricky to get going on a cluster. Depending on type of job scheduler your cluster uses you will need to adjust a settings file. You may want to take help from your systems administrator for all of this since you may not have the necessary rights to install things/change settings. Once the software is installed and configured do the test run as suggested on the page I linked.

Take a look at the hardware requirements as well. You would need a node with large amount of RAM. 512G would be preferable.

ADD REPLYlink written 9 months ago by genomax87k

thank you genomax. Do you think is it better to do it wihout cluster. Is there any ohter ways to do it ?

ADD REPLYlink written 9 months ago by samnioue0

I don't know. If you have a high memory server available you could try running on it. You can try contacting 10x tech support. They are pretty responsive.

ADD REPLYlink modified 9 months ago • written 9 months ago by genomax87k
2
gravatar for genomax
3.5 years ago by
genomax87k
United States
genomax87k wrote:

Was able to complete a supernova run in less than 2 days for a human sample (from a cell line). I had about 650M reads from HiSeq 4000. I used 12 cores on a node that had a lot of RAM. I did not watch the process continuously but the memory usage stayed below ~1TB.

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by genomax87k

Sounds reasonable. Happy to hear it's working as expected for someone.

The run I started in December with 16 threads and 512 GB memory with one full HiSeq 4000 lane of data is still going. I am sure there is something odd about the library.

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by igor11k

It needed more than 512G so perhaps that is the problem you are facing. Is the job doing anything at all (1+ month is odd)?

ADD REPLYlink written 3.5 years ago by genomax87k

Still going. Only 200% CPU, but many processes.

I doubt it's the memory. When I tried it with lower limits, it complained fairly quickly.

ADD REPLYlink written 3.5 years ago by igor11k
1

If it is of any help , I used longranger wgs pipeline on high mem nodes(lsf) gave 200Gb ram data sequenced was human wgs (40x coverage) on hiseq x ten run time was 15 days. their lariat aligner i guess is written in python is the most time consuming step. And you are right I tried using normal nodes and it would die on me because of memory issues. In any case the run time can be more in case of de novo.

ADD REPLYlink written 3.1 years ago by badribio240

I just ran the mkoutput. Took ~10 mins. We are going to collect more data for this sample so I will post an update to see if the assembly improves.

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by genomax87k

In case anyone is wondering, my Supernova run finally finished. It took 6 weeks. Obviously something is wrong since the output is essentially useless (0.02 Gb assembly size), but at least I don't have to wonder anymore. This is the summary report:

--------------------------------------------------------------------------------
INPUT
-  450.81 M   = READS          = number of reads; ideal 800-1200 for human
-  139.00 b   = MEAN READ LEN  = mean read length after trimming; ideal 140
-   13.19 x   = EFFECTIVE COV  = effective read coverage; ideal ~42 for nominal 56x cov
-   58.48 %   = READ TWO Q30   = fraction of Q30 bases in read 2; ideal 75-85
-    0.35 kb  = MEDIAN INSERT  = median insert size; ideal 0.35-0.40
-   81.86 %   = PROPER PAIRS   = fraction of proper read pairs; ideal >=75
-   18.09 kb  = MOLECULE LEN   = weighted mean molecule size; ideal 50-100
-    0.00 kb  = HETDIST        = mean distance between heterozygous SNPs
-   11.94 %   = UNBAR          = fraction of reads that are not barcoded
-  360.00     = BARCODE N50    = N50 reads per barcode
-   31.63 %   = DUPS           = fraction of reads that are duplicates
-   21.96 %   = PHASED         = nonduplicate and phased reads; ideal 45-50
--------------------------------------------------------------------------------
OUTPUT
-    1.55 K   = LONG SCAFFOLDS = number of scaffolds >= 10 kb
-    0.50 kb  = EDGE N50       = N50 edge size
-   10.22 kb  = CONTIG N50     = N50 contig size
-    0.00 Mb  = PHASEBLOCK N50 = N50 phase block size
-    0.01 Mb  = SCAFFOLD N50   = N50 scaffold size
-    0.01 Mb  = SCAFFOLD N60   = N60 scaffold size
-    0.02 Gb  = ASSEMBLY SIZE  = assembly size (only scaffolds >= 10 kb)
--------------------------------------------------------------------------------
ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by igor11k

Was this a human sample? One I had ended up at ~1.6 Gb.

ADD REPLYlink written 3.5 years ago by genomax87k

No. It was a different species. Should be around 0.8 Gb.

So does it mean half the genome was missing for you? Is that normal?

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by igor11k

My guess is that your heterozygosity level is above 1%, and we see that SuperNova really doesn't like that. What does genomescope say about your reads ? (please use unbarcoded reads, otherwise genomescope will take the barcodes into the calculation)

ADD REPLYlink written 3.2 years ago by jvhaarst0

Hello,

I just got the output recently. I want to evaluate the quality of the assembly. I am working on the reptile genome, not humans. Can you suggest some software or anything else?

Thank you!

ADD REPLYlink written 7 months ago by samnioue0
1
gravatar for from the mountains
3.1 years ago by
United States
from the mountains110 wrote:

They've just come out with a technical note on smaller assemblies: https://support.10xgenomics.com/de-novo-assembly/sample-prep/doc/technical-note-guidelines-for-de-novo-assembly-of-genomes-smaller-than-~3-gb-using-10x-genomics-supernova-v12

ADD COMMENTlink written 3.1 years ago by from the mountains110
2

More importantly, they came out with a new version of the reagents (v2). We had multiple failed assemblies with v1, but all worked with v2.

ADD REPLYlink written 3.1 years ago by igor11k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 899 users visited in the last hour