Large Variation In Base Space Vs Color Space Assembly
1
1
Entering edit mode
11.5 years ago
lin.barnum ▴ 230

I used the pipeline given at Solid Software Tools: DeNovo Assembly/XSQ Tools pipeline mirrored at BioStar to perform a solid assembly. It ran successfully. However, when I look at the statistics of the nucleotide-space assembly and the double-encoded colorspace assembly, they are significantly off. What's the reason?

contigs$ cat n50.stats.txt 

perc A               : 31
perc C               : 22
perc G               : 20
perc T               : 24
perc N               :  0
Sum contig length    : 182066280
Num contigs          : 1204729
Mean contig length   : 151
Median contig length : 128
N50 value            : 154
Max                  : 5517

nt_contigs$ cat n50.stats.txt 

perc A               : 55
perc C               :  0
perc G               :  0
perc T               : 44
perc N               :  0
Sum contig length    : 199569293
Num contigs          : 1204729
Mean contig length   : 165
Median contig length : 140
N50 value            : 166
Max                  : 5013

scaffolds$ cat n50.stats.txt 

perc A               : 10
perc C               :  7
perc G               :  6
perc T               :  7
perc N               : 67
Sum contig length    : 563660388
Num contigs          : 855887
Mean contig length   : 658
Median contig length : 140
N50 value            : 3997
Max                  : 74154

nt_scaffolds$ cat n50.stats.txt 

perc A               : 55
perc C               :  0
perc G               :  0
perc T               : 44
perc N               :  0
Sum contig length    : 200084049
Num contigs          : 855887
Mean contig length   : 233
Median contig length : 146
N50 value            : 242
Max                  : 18952

The N50 value in the case of scaffolds is really off. Also, the GC% in nt_contigs and nt_scaffolds is zero which is odd.

solid assembly velvet • 2.6k views
ADD COMMENT
2
Entering edit mode
11.5 years ago

Remember that the double encoded colorspace is a redundant representation.

Two entirely different looking double encoded sequences could represent identical base space sequences.

That being said getting a zero percentage for GC base representation does look like something went wrong, unless you have reason to expect that

ADD COMMENT
0
Entering edit mode

While that is true why is the max length and so on different? Shouldn't colorspace be just 1 less than the length of the basespace sequence.

ADD REPLY
1
Entering edit mode

once you convert to colorspace the sequences change altogether, two different looking sequences may convert to the same sequence see also this: Transforming and manipulating color space reads

ADD REPLY
0
Entering edit mode

Another question here. How could two different colorspace sequences represent the same basespace sequence? I can see that each colorspace sequence could represent 4 different basespace sequences. A toy example would be useful here. Thanks.

ADD REPLY

Login before adding your answer.

Traffic: 1751 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6