From colorspace to basespace format
4
0
Entering edit mode
6.4 years ago
user230613 ▴ 310

Hello biostars,
I know that this issue has been widely discussed along several threads. But I can't figure out how to do the conversion. I've downloaded some SRA datasets and I've converted to .csfasta and .qual using abi-dump. So, I've these two files:

.csfasta file:

#
# Title: solid0065_201006
#
>SRR407311.1 2_21_490_F3
T3.23121101332.0133.2221.23.2.2103.330320302..32320

.qual file:

#
# Title: solid0065_201006
#
>SRR407311.1 2_21_490_F3
31 0 12 24 17 20 29 21 16 18 30 22 24 0 24 10 26 22 0 19 26 23 14 0 27 26 0 13 0 20 6 10 11 0 15 30 19 15 22 4 18 31 4 0 0 33 14 9 8 5

I want to convert them to basespace fastq format. I've tried to use solid2fastq.pl script from BWA, but I get an empty fastq file. And also I've tried solid2fastq from bfast aligner, but I get:

@SRR407311.1 2_21_490
T3.23121101332.0133.2221.23.2.2103.330320302..32320
+
@!-925>613?79!9+;7!4;8/!<;!.!5'+,!0?407%3@%!!B/*)&

I've already read the threads related about "Not doing this conversion", but I need to map the files with STAR, and STAR actually does not accept colorspace data.

solid colorspace • 2.8k views
0
Entering edit mode

Why do you need to align this with STAR? The results will largely be complete crap (the . bases will completely destroy the conversion).

0
Entering edit mode

So, should I use another RNA mapper to map these reads without doing the conversion? I "need" to map them with STAR, because I'm doing some tests with different RNASeq datasets in order to improve the annotation of a genome. And since for the other datasets (not solid), I've used STAR, I wanted to use STAR also for this dataset, to prevent bias related to the used mapper. Hope I'm explained myself a little bit.

2
Entering edit mode

That is not a good way to improve anything - using the exact same tool regardless of whether it is appropriate or not.

In general I have observed this tendency and I am not criticizing you in particular rather the field in general - many bioinformatician scientists confound the concept of repeatability and reliability of any given result with using the exact same tool with the same parameters. To me that is actually completely backwards. How much stock should anyone put into an annotation that is produced only when one uses one particular tool? I would much sooner assume that there is a flaw in the way the tool works (rather than it being super effective) if only that tool can produce a result.

1
Entering edit mode

I'm in complete agreement with Istvan on this one. Use a color-space aware aligner with this dataset to get the best results. The bias of using likely poor-quality results will dwarf any difference due to an aligner-effect.

0
Entering edit mode

I agree with you two, thanks a lot. I'll discuss these issues with my supervisor.
Thank you all, again :)

1
Entering edit mode
6.4 years ago

You cannot do that conversion. You need compulsorily work in colorspace

Notice that AFTER a single change in the color of a colorspace sequence, the whole converted sequence will change completely, while when compared in colorspace, only that base will change, preserving the information of the next sequence unchanged

0
Entering edit mode

The situation of converting color space to base space is not that dire as it used to be. Many aligners such as bwa mem are now able to clip off the ends of a sequence and still align it as long as the correct sequence is sufficiently long (say 20 or so). That being said it is probably best to use a color space aware aligner (if possible).

as for the original post there are different types of conversions, see this:

Transforming And Manipulating Color Space Reads

1
Entering edit mode
6.4 years ago

You can, if you want, translate into fake basespace and map there.  For example, using reformat from the BBMap package:

reformat.sh in=genome.csfasta out=fake_ref.fa remap=0A1C2G3T.N

Then you can do the mapping with any basespace aligner.  Afterward you can convert the results back to colorspace if you want.  I think the mapping coordinates will all end up off by 1, but when you just want gene coverage, that won't matter.

Please note that fake basespace is not the same as real basespace - it's just encoding the colorspace data in a format that normal aligners can handle.

0
Entering edit mode
6.4 years ago

You can use TopHat preventing that you need to use bowtie1 as the mapper since bowtie2 is not compatible with colorspace comparisons. Instructions are provide into TopHat

0
Entering edit mode

Well, I'm trying to do this with tophat, but I'm having this error:

Error running bowtie:
Too few quality values for read: 9200A
are you sure this is a FASTQ-int file?
terminate called after throwing an instance of 'int'


I've searched on internet, and there are several "open" posts related to this issue. Any suggestion? Or should I change the mapper? (And yes, maybe I should ask this question in a new post, but I'm commenting this post because it is related to your answer. Thank you).

0
Entering edit mode
6.4 years ago

Have you done a FastQC checking of your data?

The program is alerting you that there are sequences of poor quality that needs to be filtered before using a mapper. There are many filtering programs. Many people use fastx-toolkit, trimommatic, etc

0
Entering edit mode

Do any of those programs even work in colorspace?

0
Entering edit mode

FastQC can work with colorspaced data, but I think it does it by converting them to regular bases. This means that quality assesment is nice when evaluating the Q values, but some problems can arise when sequences needs to be considered

To be honest, I've processing a colleague´s RNA-Seq experiment done with SOLiD data, and this has been a frustration most of the time, at least in my hands. Eventually only 7% of the reads could be aligned to the transcriptome..