6.1 years ago by
A while ago I have tried something similar. The goal was to determine the multi locus sequence type directly from the sequencing reads for all the datasets available from the short read archive for Staphylococcus aureus. By now, the big sequencing centers (namely Sanger) have deposited more than 10000 such datasets in the short read archive. I expect that most of them will never be assembled and put into the WGS database.
The determination of the sequence type worked reasonably well for my test datasets. But when I applied it to larger number of datasets I run into all kinds of odd behaviour:
1) I soon run out of disk space
2) misassigned species (not S.aureus but another staphylococcal species)
3) ill defined datasets
For example I found FASTQ files with all reads of length 200 where the first 100 nucleotides represent the first read of a paired end run, and the next 100 nucleotides represent the second read. Or I found FASTQ files from paired end runs where some of the sequences of the second read where reverse-complemented but others were not. I also found FASTQ files where sequences were reverse-complemented but not the quality values. Thus I came to the conclusion that I have to inspect each downloaded FASTQ file individually before applying the automated MLST search, and I never started any automated download of a large number of SRR datasets.
You will need a substancial ammount of disk space for your project, my guess is about 10 TB. There are currently about 2800 SRR datasets available for Vibrio cholerae. In average a dataset may be 700 Mb in size. If you convert SRA to FASTQ, you will need another 700 Mb. If you map the reads to a reference genome, the resulting BAM file will also be at least 700 Mb. This is because you copy the data all the time in typical work flows involving FASTQ and BAM files. Thus you will need about 2.5 Gb disk space for every SRA dataset you process.
With my (maybe limited) resources the download of a single dataset from the european short read archive took about 5 minutes. This means that downloading 3000 datasets will take about 10 days to finish. All this is feasible but it is definitely not a project for a rainy sunday afternoon.
6.1 years ago by
piet • 1.8k