10x Supernova de novo assembly
2
2
Entering edit mode
5.5 years ago
igor 12k

I am working on my first 10x Genomics de novo assembly. For those who may not be familiar with it, 10x adds barcodes to Illumina short reads coming from the same long fragments that are then used to reconstruct long "linked" reads. Since these are non-standard "linked" reads and the technology is fairly new, I don't think there are any tools available for processing the data other than the official Supernova software. Has anyone had luck with alternate assemblers for 10x data?

The reason I am asking is because Supernova is essentially failing for me. I used one HiSeq lane of input, which is fairly reasonable. Based on my discussions with 10x, that amount of data should take about a week to process. I had it going for 4 weeks and it did not finish. Then, the server had to be restarted, so the job was killed. Even half a lane did not finish after several weeks. If I use a small subset of data (like 10M or 100M reads), the process finishes. The results are terrible, but at least it shows there aren't any problems with the dependencies or the environment. I was wondering if anyone here had run into any similar problems and if they had any solutions. Normally, if a certain tool has issues, you can try a different one, but that does not seem to be the case here.

Assembly 10x • 6.1k views
0
Entering edit mode

Have you tried other long-read assemblers, like Celera, or tools intended for PacBio?

0
Entering edit mode

I think the issue is getting a hold of gem specific pools of reads. I am still familiarizing myself with some 10x data.

0
Entering edit mode

Exactly. If I could get to long reads, that would be fantastic. There are basically two steps: converting short reads to "long" reads and then assembling those reads. Unfortunately, those two steps are combined in their software and there is no way to just do one or the other.

0
Entering edit mode

Oh, I see. Well, good luck :) If I had access to some 10x data, I might be able to write something that converted it to long reads via Tadpole, but I've never seen any 10x data.

0
Entering edit mode

They post some examples here if you want to check: http://support.10xgenomics.com/de-novo-assembly/datasets

I think the biggest problem is that the whole process is not described in detail (for example, the barcode sequences aren't published as far as I know), so it would probably require some reverse engineering.

0
Entering edit mode

Have you tried longranger mkfastq followed by longranger basic? That yields a single file of interleaved fastq's labeled with barcodes. Order of R1/R2 is not guaranteed and R1/R2 markers are not there in the headers. It may be better to do something with the files from longranger mkfastq.

0
Entering edit mode

That's an interesting idea. I never checked longranger documentation since it's a different workflow. longranger mkfastq is basically a bcl2fastq wrapper and just complicates things if you are already familiar with bcl2fastq. longranger basic sounds promising, though. Even if it works as expected, trying to assemble the linked reads yourself is not a trivial undertaking.

0
Entering edit mode

Did the server you were using meet their required specifications? Performing assembly on a full lane will take much, much more RAM and CPU than a fifth (100M reads). I've had good luck with their assembly software but it needs ample resources to do its job properly.

0
Entering edit mode

I ran it with 16 threads and 512GB RAM, which is more than they require. Also, there would probably be a memory-related error if that was a problem.

Nice to hear that someone else got it to work, though.

1
Entering edit mode

That must be 512GB :)

0
Entering edit mode

You're right. It's been a long week.

0
Entering edit mode

Hi Igor and genomax,

I hope you can still see this message. I am just going to start using supernova. I am having a hard time using it. Can you please suggest articles, websites or anything that can help to use it. or if you can tell me the steps that I should since you seem to know how to work with it. Thank you so much in advance. Looking forward to your reply.

0
Entering edit mode
0
Entering edit mode

samnioue : supernova can be tricky to get going on a cluster. Depending on type of job scheduler your cluster uses you will need to adjust a settings file. You may want to take help from your systems administrator for all of this since you may not have the necessary rights to install things/change settings. Once the software is installed and configured do the test run as suggested on the page I linked.

Take a look at the hardware requirements as well. You would need a node with large amount of RAM. 512G would be preferable.

0
Entering edit mode

thank you genomax. Do you think is it better to do it wihout cluster. Is there any ohter ways to do it ?

0
Entering edit mode

I don't know. If you have a high memory server available you could try running on it. You can try contacting 10x tech support. They are pretty responsive.

2
Entering edit mode
5.3 years ago
GenoMax 115k

Was able to complete a supernova run in less than 2 days for a human sample (from a cell line). I had about 650M reads from HiSeq 4000. I used 12 cores on a node that had a lot of RAM. I did not watch the process continuously but the memory usage stayed below ~1TB.

0
Entering edit mode

Sounds reasonable. Happy to hear it's working as expected for someone.

The run I started in December with 16 threads and 512 GB memory with one full HiSeq 4000 lane of data is still going. I am sure there is something odd about the library.

0
Entering edit mode

It needed more than 512G so perhaps that is the problem you are facing. Is the job doing anything at all (1+ month is odd)?

0
Entering edit mode

Still going. Only 200% CPU, but many processes.

I doubt it's the memory. When I tried it with lower limits, it complained fairly quickly.

1
Entering edit mode

If it is of any help , I used longranger wgs pipeline on high mem nodes(lsf) gave 200Gb ram data sequenced was human wgs (40x coverage) on hiseq x ten run time was 15 days. their lariat aligner i guess is written in python is the most time consuming step. And you are right I tried using normal nodes and it would die on me because of memory issues. In any case the run time can be more in case of de novo.

0
Entering edit mode

I just ran the mkoutput. Took ~10 mins. We are going to collect more data for this sample so I will post an update to see if the assembly improves.

0
Entering edit mode

In case anyone is wondering, my Supernova run finally finished. It took 6 weeks. Obviously something is wrong since the output is essentially useless (0.02 Gb assembly size), but at least I don't have to wonder anymore. This is the summary report:

--------------------------------------------------------------------------------
INPUT
-  450.81 M   = READS          = number of reads; ideal 800-1200 for human
-  139.00 b   = MEAN READ LEN  = mean read length after trimming; ideal 140
-   13.19 x   = EFFECTIVE COV  = effective read coverage; ideal ~42 for nominal 56x cov
-   58.48 %   = READ TWO Q30   = fraction of Q30 bases in read 2; ideal 75-85
-    0.35 kb  = MEDIAN INSERT  = median insert size; ideal 0.35-0.40
-   81.86 %   = PROPER PAIRS   = fraction of proper read pairs; ideal >=75
-   18.09 kb  = MOLECULE LEN   = weighted mean molecule size; ideal 50-100
-    0.00 kb  = HETDIST        = mean distance between heterozygous SNPs
-   11.94 %   = UNBAR          = fraction of reads that are not barcoded
-  360.00     = BARCODE N50    = N50 reads per barcode
-   31.63 %   = DUPS           = fraction of reads that are duplicates
-   21.96 %   = PHASED         = nonduplicate and phased reads; ideal 45-50
--------------------------------------------------------------------------------
OUTPUT
-    1.55 K   = LONG SCAFFOLDS = number of scaffolds >= 10 kb
-    0.50 kb  = EDGE N50       = N50 edge size
-   10.22 kb  = CONTIG N50     = N50 contig size
-    0.00 Mb  = PHASEBLOCK N50 = N50 phase block size
-    0.01 Mb  = SCAFFOLD N50   = N50 scaffold size
-    0.01 Mb  = SCAFFOLD N60   = N60 scaffold size
-    0.02 Gb  = ASSEMBLY SIZE  = assembly size (only scaffolds >= 10 kb)
--------------------------------------------------------------------------------

0
Entering edit mode

0
Entering edit mode

No. It was a different species. Should be around 0.8 Gb.

So does it mean half the genome was missing for you? Is that normal?

0
Entering edit mode

My guess is that your heterozygosity level is above 1%, and we see that SuperNova really doesn't like that. What does genomescope say about your reads ? (please use unbarcoded reads, otherwise genomescope will take the barcodes into the calculation)

0
Entering edit mode

Hello,

I just got the output recently. I want to evaluate the quality of the assembly. I am working on the reptile genome, not humans. Can you suggest some software or anything else?

Thank you!

1
2
Entering edit mode

More importantly, they came out with a new version of the reagents (v2). We had multiple failed assemblies with v1, but all worked with v2.