Hi everyone,
I need to work with SOLiD small RNA-seq libraries from SRA, and I have some questions about quality control and the to-do and not-to-do with this data.
1- Which is the best approach for quality checking my reads? I'm used to fastqc for quality checking but I'm not sure if it can handle colourspace reads (or handle it properly).
2- How I should filter/trim my reads? I may need to trim adaptors, but don't want to quality-trim my reads, as they represent entire RNAs (sRNAs) and trimming it could lead to artifacts. I would rather filter out reads with low quality bases. Which is a good tool for trimming adaptors/quality filtering the libraries and what range of quality values are considered acceptable (we usually apply a Q30 filter for illumina libraries)?
3 - As far as I'm concerned, SOLiD sequences a small portion of the 5' adapter, besides the possible sequencing of the 3' adapter. Should I take this into account during adaptor trimming or is it usually already removed?
4- This set of data has both data from SOLiD and Illumina (different libraries, of course), but for the sake of standardization we would like to use STAR for aligning all libraries. I don't know if it can handle colourspace reads, so I would like to know if it is ok to turn my colourspace reads into fastq, and at which point (quality checking, filtering our before alignment) could I do this. If its not advisable to do so, which tools are the best for SOLiD reads alignment.
I know I've asked a lot of questions, but I never did anything with this kind of data. If someone could at least point some directions, some papers that could help with one of this questions, it would be great.
Thanks!
Where did you get this information from? Can you provide a source?
Regarding your questions, I think the best quality metrics for SOLiD reads are those based on mapping. FastQC supports SOLiD, but (in my experience) be prepared to see some very bad metrics. As SOLiD are most useful when mapping, I wouldn't bother quality trimming. I never put too much thinking about adapter trimming for SOLiD reads, but I think the best result would be obtained after mapping.
The effect of sequencing technology will be grater than any effect different mapping software would introduce, so I think this is a pointless stardardization. Subread is a fast aligner that supports both basespace and colorspace reads, but you have to build one index for each technology anyway. As JC said, do not convert from colorspace to basespace. The main (advertised) advantage of SOLiD sequencing is the di-nucleotide sequencing is better to tell apart sequencing errors from mutations, when mapping to a reference genome. Or, to put it conversely, if you translate colorspace to basespace, when there is a sequencing error in colorspace, when one converts the read to basespace, all bases downstream the error will also be wrongly converted.
There's this article on NGS technologies that says SOLiD sequencing has 5 ligation rounds, and after the first one the primers added in order to start the new sequencing round anneals at -1 position from the previous one, so at least 4 bases from the 5' adapter should be sequenced. I wasn't sure if this sequence would be found in the read.
So, the whole idea is to deal with quality after mapping my libraries to my reference? This is very different from what I'm used to, the explanation was very helpful. Thank you so much!