I have different 10X genomics data assemblies using different subsets of reads with supernova 2.01. We have what should be the best assembly with ideal x42 but when I look at some genes, one of the subset read assemblies will have the missing contig placed for one gene that was missed in the best assembly. Likewise for the Mitochondria it has scaffolded different parts of it. I can't do manual curation across the genome for places I find this. It would be good if I could merge the two assemblies say if I had the below:
assembly1:
TTTTGAGAGAGANNNNNNAGAGTGAGNNNNNGGGAGAGAGAGNNNNNNNNNNNNNNN
assembly2:
TTTTGAGAGAGANNNNNNNNNNNNNNNNNNNGGGAGAGAGAGNNNNNNGGGAGAGAG
merge
TTTTGAGAGAGANNNNNNAGAGTGAGNNNNNGGGAGAGAGAGNNNNNNGGGAGAGAG
I was going to try quickmerge but I'm not sure if it will work without overlaps as it's more about scaffolding contigs differently and merging these differences. Any ideas or is it just pick the best one even though can find different parts of the genes scaffolded in different assemblies?
example of gene exon presence for curated genes (1 means all exons present on 1 contig, 2 means on two contigs and (e) means what exons were on different contigs
gene/Reads (M) 705.01 627.03 560.03 total exon
NA 1 1 1 27
270520 1 1 1 10
151 2(e4/5) 2(e4/5) 3(e6/7) 9
254 1 1 1 8
271317 3(e2/3) 1 1 5
270256 2(e4) 2(e1) 1 10
269873 1 1 1 9 note: scaffolded together on 705 but not on 627 or 560
269776 1 1 1 10 note: scaffolded together on 705 but not on 627 or 560
1041 1 1 1 4
936 3(e2/5-8) 3(e2/5-8) 4(e1-2/8) 8
259 2(e5) 2(e1) 1 9
239 2(e1) 2(e1) 1 5
176 1 1 1 9
168 1 2(e2-4) 2(e7-8) 8
256 2(e2) 1 2(e1) 9
What happens if you use all the reads instead of different subsets?
from the 10X stats which don't match a perl script I have in terms of size of genome etc, but ignoring that, the assembly is better in terms of assembling more but consequently the N50 goes down. However when go beyond numbers and look at genes then there is some variation in terms of scaffolding where there is a lot of overlap between assemblies but these differences where if combined the two assemblies it would be better scaffolded. Using all the reads for the few genes I am looking at it gets more fragmented. My longest scaffold for 627M reads is 31MBp so done' understand why 10X reports differently.
My assemblies
didn't try it in the end as got some long reads to do hybrid assembly anyway. surprised there were some snps in 10X assemblies that were mistakes too.