Question

merging 10X genomics assemblies

0

Entering edit mode

6.3 years ago

rob234king ▴ 610

I have different 10X genomics data assemblies using different subsets of reads with supernova 2.01. We have what should be the best assembly with ideal x42 but when I look at some genes, one of the subset read assemblies will have the missing contig placed for one gene that was missed in the best assembly. Likewise for the Mitochondria it has scaffolded different parts of it. I can't do manual curation across the genome for places I find this. It would be good if I could merge the two assemblies say if I had the below:

assembly1:
TTTTGAGAGAGANNNNNNAGAGTGAGNNNNNGGGAGAGAGAGNNNNNNNNNNNNNNN
assembly2:
TTTTGAGAGAGANNNNNNNNNNNNNNNNNNNGGGAGAGAGAGNNNNNNGGGAGAGAG

merge
TTTTGAGAGAGANNNNNNAGAGTGAGNNNNNGGGAGAGAGAGNNNNNNGGGAGAGAG

I was going to try quickmerge but I'm not sure if it will work without overlaps as it's more about scaffolding contigs differently and merging these differences. Any ideas or is it just pick the best one even though can find different parts of the genes scaffolded in different assemblies?

example of gene exon presence for curated genes (1 means all exons present on 1 contig, 2 means on two contigs and (e) means what exons were on different contigs

gene/Reads (M)  705.01  627.03  560.03  total exon  
NA  1   1   1   27  
270520  1   1   1   10  
151 2(e4/5) 2(e4/5) 3(e6/7) 9   
254 1   1   1   8   
271317  3(e2/3) 1   1   5   
270256  2(e4)   2(e1)   1   10  
269873  1   1   1   9   note: scaffolded together on 705 but not on 627 or 560
269776  1   1   1   10  note: scaffolded together on 705 but not on 627 or 560
1041    1   1   1   4   
936 3(e2/5-8)   3(e2/5-8)   4(e1-2/8)   8   
259 2(e5)   2(e1)   1   9   
239 2(e1)   2(e1)   1   5   
176 1   1   1   9   
168 1   2(e2-4) 2(e7-8) 8   
256 2(e2)   1   2(e1)   9

10X • 2.3k views

ADD COMMENT • link updated 5.6 years ago by harish ▴ 470 • written 6.3 years ago by rob234king ▴ 610

1

Entering edit mode

What happens if you use all the reads instead of different subsets?

ADD REPLY • link 6.3 years ago by igor 13k

0

Entering edit mode

from the 10X stats which don't match a perl script I have in terms of size of genome etc, but ignoring that, the assembly is better in terms of assembling more but consequently the N50 goes down. However when go beyond numbers and look at genes then there is some variation in terms of scaffolding where there is a lot of overlap between assemblies but these differences where if combined the two assemblies it would be better scaffolded. Using all the reads for the few genes I am looking at it gets more fragmented. My longest scaffold for 627M reads is 31MBp so done' understand why 10X reports differently.

My assemblies

Reads   742.96  705.01  627.03  560.03  493.03
raw cov (ideal 56)  83.73   79.32   70.82   62.83   55.63
effective cov (ideal 42)    45.06   43.23   39.64   36.02   32.7
Est genome size 1.33    1.33    1.33    1.34    1.33
Repetitive  22.13   22.17   22.23   22.35   22.44
AT (%)  0.38    0.38    0.39    0.39    0.39
Mol length  60.71   60.71   61.93   61.32   63.81
P10 186.03  180.56  177.5   176.06  174.24
Hetdist 121 123 120 129 130
DUPs (%)    38.57   37.81   36.16   34.63   32.99
Phased (%)  38.28   38.86   39.92   40.52   40.7
Longest scaffold    16.93   16.4    15.05   14.28   13.58
contig N50  33.53   33.88   34.95   35.23   34.15
Phase bloack (Mbp)  1.49    1.5 1.69    1.56    1.33
Scaffold N50 (Kb)   703.79  717.72  933.61  852.17  708.68
Missing (%) 19.28   19.44   20  21.49   25.23
Assembly size (Mbp) 1.02    1.03    994.12  959.02  890.42

ADD REPLY • link 6.3 years ago by rob234king ▴ 610

0

Entering edit mode

didn't try it in the end as got some long reads to do hybrid assembly anyway. surprised there were some snps in 10X assemblies that were mistakes too.

ADD REPLY • link 6.0 years ago by rob234king ▴ 610

score 0 · Answer 1 · 2018-07-18

0

Entering edit mode

6.3 years ago

rob234king ▴ 610

I will give CAMSA a go which looks like might do what I want and update but any other ideas, please let me know https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1919-y

ADD COMMENT • link 6.3 years ago by rob234king ▴ 610

0

Entering edit mode

How did you get on with CAMSA? I'm in the same boat and looking around for tools written to do this job (I could knock something up myself but why re-invent the wheel, right?)

ADD REPLY • link 6.0 years ago by maxwhjohn1988 ▴ 130

score 0 · Answer 2 · 2019-03-20

0

Entering edit mode

5.6 years ago

harish ▴ 470

Why not try something like Meta-assembler/GARM etc?

But generally speaking it's not advisable to merge scaffold level assemblies or assemblies hosting ambiguities. The ambiguities get propagated in a compounding fashion and may cause issues downstream.

ADD COMMENT • link 5.6 years ago by harish ▴ 470