Question: From colorspace to basespace format
0
gravatar for user230613
3.8 years ago by
user230613280
Europe
user230613280 wrote:

Hello biostars,
I know that this issue has been widely discussed along several threads. But I can't figure out how to do the conversion. I've downloaded some SRA datasets and I've converted to .csfasta and .qual using abi-dump. So, I've these two files:

.csfasta file:

#
# Title: solid0065_201006
#
>SRR407311.1 2_21_490_F3
T3.23121101332.0133.2221.23.2.2103.330320302..32320

.qual file:

#
# Title: solid0065_201006
#
>SRR407311.1 2_21_490_F3
31 0 12 24 17 20 29 21 16 18 30 22 24 0 24 10 26 22 0 19 26 23 14 0 27 26 0 13 0 20 6 10 11 0 15 30 19 15 22 4 18 31 4 0 0 33 14 9 8 5

I want to convert them to basespace fastq format. I've tried to use solid2fastq.pl script from BWA, but I get an empty fastq file. And also I've tried solid2fastq from bfast aligner, but I get:

@SRR407311.1 2_21_490
T3.23121101332.0133.2221.23.2.2103.330320302..32320
+
@!-925>613?79!9+;7!4;8/!<;!.!5'+,!0?407%3@%!!B/*)&

I've already read the threads related about "Not doing this conversion", but I need to map the files with STAR, and STAR actually does not accept colorspace data.

Thanks in advance,
 

 

colorspace solid • 2.1k views
ADD COMMENTlink modified 3.8 years ago by Antonio R. Franco4.0k • written 3.8 years ago by user230613280

Why do you need to align this with STAR? The results will largely be complete crap (the . bases will completely destroy the conversion).

ADD REPLYlink written 3.8 years ago by Devon Ryan89k

So, should I use another RNA mapper to map these reads without doing the conversion? I "need" to map them with STAR, because I'm doing some tests with different RNASeq datasets in order to improve the annotation of a genome. And since for the other datasets (not solid), I've used STAR, I wanted to use STAR also for this dataset, to prevent bias related to the used mapper. Hope I'm explained myself a little bit.

ADD REPLYlink written 3.8 years ago by user230613280
2

That is not a good way to improve anything - using the exact same tool regardless of whether it is appropriate or not.

In general I have observed this tendency and I am not criticizing you in particular rather the field in general - many bioinformatician scientists confound the concept of repeatability and reliability of any given result with using the exact same tool with the same parameters. To me that is actually completely backwards. How much stock should anyone put into an annotation that is produced only when one uses one particular tool? I would much sooner assume that there is a flaw in the way the tool works (rather than it being super effective) if only that tool can produce a result.

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by Istvan Albert ♦♦ 80k
1

I'm in complete agreement with Istvan on this one. Use a color-space aware aligner with this dataset to get the best results. The bias of using likely poor-quality results will dwarf any difference due to an aligner-effect.

ADD REPLYlink written 3.8 years ago by Devon Ryan89k

I agree with you two, thanks a lot. I'll discuss these issues with my supervisor.
Thank you all, again :)
 

ADD REPLYlink written 3.8 years ago by user230613280
1
gravatar for Antonio R. Franco
3.8 years ago by
Spain. Universidad de Córdoba
Antonio R. Franco4.0k wrote:

You cannot do that conversion. You need compulsorily work in colorspace

Notice that AFTER a single change in the color of a colorspace sequence, the whole converted sequence will change completely, while when compared in colorspace, only that base will change, preserving the information of the next sequence unchanged

ADD COMMENTlink written 3.8 years ago by Antonio R. Franco4.0k

The situation of converting color space to base space is not that dire as it used to be. Many aligners such as bwa mem are now able to clip off the ends of a sequence and still align it as long as the correct sequence is sufficiently long (say 20 or so). That being said it is probably best to use a color space aware aligner (if possible).

as for the original post there are different types of conversions, see this:

Transforming And Manipulating Color Space Reads

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by Istvan Albert ♦♦ 80k
1
gravatar for Brian Bushnell
3.8 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

You can, if you want, translate into fake basespace and map there.  For example, using reformat from the BBMap package:

reformat.sh in=reads.csfasta out=fake_bs.fa remap=0A1C2G3T.N ftl=1

reformat.sh in=genome.csfasta out=fake_ref.fa remap=0A1C2G3T.N

Then you can do the mapping with any basespace aligner.  Afterward you can convert the results back to colorspace if you want.  I think the mapping coordinates will all end up off by 1, but when you just want gene coverage, that won't matter.

Please note that fake basespace is not the same as real basespace - it's just encoding the colorspace data in a format that normal aligners can handle.

 

ADD COMMENTlink written 3.8 years ago by Brian Bushnell16k
0
gravatar for Antonio R. Franco
3.8 years ago by
Spain. Universidad de Córdoba
Antonio R. Franco4.0k wrote:

You can use TopHat preventing that you need to use bowtie1 as the mapper since bowtie2 is not compatible with colorspace comparisons. Instructions are provide into TopHat

ADD COMMENTlink written 3.8 years ago by Antonio R. Franco4.0k

Well, I'm trying to do this with tophat, but I'm having this error:

Error running bowtie:
Too few quality values for read: 9200A
    are you sure this is a FASTQ-int file?
terminate called after throwing an instance of 'int'

I've searched on internet, and there are several "open" posts related to this issue. Any suggestion? Or should I change the mapper? (And yes, maybe I should ask this question in a new post, but I'm commenting this post because it is related to your answer. Thank you).

ADD REPLYlink written 3.8 years ago by user230613280
0
gravatar for Antonio R. Franco
3.8 years ago by
Spain. Universidad de Córdoba
Antonio R. Franco4.0k wrote:

Have you done a FastQC checking of your data?

The program is alerting you that there are sequences of poor quality that needs to be filtered before using a mapper. There are many filtering programs. Many people use fastx-toolkit, trimommatic, etc

ADD COMMENTlink written 3.8 years ago by Antonio R. Franco4.0k

Do any of those programs even work in colorspace?

ADD REPLYlink written 3.8 years ago by Brian Bushnell16k

FastQC can work with colorspaced data, but I think it does it by converting them to regular bases. This means that quality assesment is nice when evaluating the Q values, but some problems can arise when sequences needs to be considered

To be honest, I've processing a colleague´s RNA-Seq experiment done with SOLiD data, and this has been a frustration most of the time, at least in my hands. Eventually only 7% of the reads could be aligned to the transcriptome..

ADD REPLYlink written 3.8 years ago by Antonio R. Franco4.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 936 users visited in the last hour