the Sequence Read Archive (SRA) is the NCBI's repository for publishing NGS data, and hence a great place where to look for test datasets for trying out your algorithms of interest.
we are currently trying to evaluate the different mapping results from several tools dealing wiht color space SOLiD data, and we would like to use available reads from the SRA, but all we find there are fastq files. if each run would be just a single fastq file containing all reads we should be able to use it straight away (shouldn't we?), but we are getting triplets of files that we are not sure how to process them.
an example case could easily be SRX004555 (AB SOLiD sequencing of Human HapMap individual NA18507 genomic paired-end library). when trying to download the available data from this experiment, you will find 4 fastq file triplets, 1 triplet per experiment run, and here is where we are not sure how to proceed: should we map each file independently and join the results? should we join the fastq files into a single massive one and then map it?
PS: does anybody know if csfasta and qual files are present in the SRA? where could one obtain such data from? the only site we have found is the proper SOLiD website, but the available datasets are not that many.