convert SOLID color space to base-space
1
0
Entering edit mode
8.0 years ago
hellbio ▴ 490

Hi all,

I have short reads from SOLID5500XL sequencing platform. The reads are in '.xsq' format. I have used XSQ tools from life technologies http://www.lifetechnologies.com/fi/en/home/technical-resources/software-downloads/xsq-software.html to convert .xsq to .csfasta and .qual files as shown below;

xsqconvert -c FRAG_BC_01_Can19.xsq

which results in

FRAG_BC_01_Can19_F3.csfasta and FRAG_BC_01_Can19_F3.qual files.

Then i have used 'qualfa2fq.pl' script from bwa to convert to fastq format as shown below:

qualfa2fq.pl FRAG_BC_06_Can19_F3.csfasta FRAG_BC_06_Can19_F3.QV.qual

The fastq file still has the base pairs in color space format. My goal is to detect SNP's by aligning to reference genome. For this purpose i would need the data in base space format. Could someone help to do this?

Any help is highly valuable.

SOLID base space • 4.8k views
1
Entering edit mode
8.0 years ago

You cannot accurately convert colorspace to base-space without alignment, because a single error will make all subsequent bases incorrect, and Solid reads have lots of errors. So, you need to use a colorspace-capable aligner to do the alignment to the colorspace-indexed genome, and then convert the aligned reads to base-space prior to calling variations. Bowtie 1 can do colorspace alignment with the -C flag. BBMap used to be able to do colorspace mapping, but not anymore. bwa also used to be able to map in colorspace, but not anymore.

0
Entering edit mode

thank you. Prior to alignment, Quality control to filter lowquality bases appears to be an ideal step for SNP detection. Would you suggest some thresholds and tools to do this? also could you suggest some tools to convert aligned reads from color space to base space?

1
Entering edit mode

I wrote a tool to convert mapped colorspace reads to base-space, but I'm not sure if it works anymore. I'll look into it. Bioscope should be able to do it, of course, but it's really crappy.

I DO NOT recommend quality-trimming Solid reads because unlike Illumina reads, the quality profile varies by the position modulo 5 rather than the raw position. Thus, low-quality bases are scattered throughout the read and trimming the ends is not effective.

0
Entering edit mode

You can use https://github.com/brentp/bio-playground/blob/master/solidstuff/solid-trimmer.py tool from brentp. I have used a lot for my research.

0
Entering edit mode

You can also use SHRiMP2 a color space read aligner. I have used it extensively for aligning csfastq or csfasta/qual files. Once you have the aligned SAM/BAM file you can use any variant callers that take bam files.

0
Entering edit mode

thank you. Is it required to filter low quality bases before aligning using SHRiMP2? If so, would you suggest a threshold value? I normally use Q20 for Illumina. Is it the same for SOLID reads. Finally, the bam from SHRiMP2 is in color space or base space?

0
Entering edit mode

You can use Q15 or Q20 as a threshold. The bam file will contain both basespace and colorspace sequences. The base space sequence will represented in the 10th column (SEQ field) and colorspace sequence will be a part of the TAGs (last) column. This bam file can now be used with almost all of the tools that work with Illumina bam files.

0
Entering edit mode

hello again. I used SHRiMP2 to align and came across the error: "my_realloc error: realloc failed" . Could not find an effective solution elsewhere. Could you help if you have faced this error?

0
Entering edit mode

Hi, Below is what I have done:

1. .csfatsa and .qual files for each sample which have been converted to fastq using 'fasta2fastq' script available in the SHRiMP2 bin folder.
2. And then, I have used the below command to align: SHRiMP_2_2_3/bin/gmapper-cs Can19.fastq canFam3.fa -N 24 -Q --qv-offset 33 > Can19.sam

The reference sequence in the above command is in letter-space format and the reads are in color space format.should the reference also be given in the cs format or does shrimp handles the letter space format to align to cs reads?

With the above command, I met with an error my_realloc error: realloc failed. Did anyone came across this error?

0
Entering edit mode

I am getting the same error. What is weird is that I have used the same command to run 10 samples in parallel. 2 of which seem to have done just fine, the other 8 have given me the "realloc error". It seems to occur during the genome loading step. Maybe some memory error?

Have you managed to solve this issue?