Question: Checking integrity of SRA downloaded fastq files
0
gravatar for flyingbebegurl
5 weeks ago by
flyingbebegurl0 wrote:

I have been downloading data using the SRA toolkit with prefetch + fastq-dump and more recently fasterq-dump. I have had a variety of messages that come up when I was in the troubleshooting process, but I still get this message regularly: fasterq-dump.2.9.6 sys: timeout exhausted while reading file within network system module - mbedtls_ssl_read returned-76 ( NET - Reading information from the socket failed ) - the process continues and eventually I get the readout stating the number of spots and reads read.

I'm now curious about checking the integrity of the downloaded data - I've read that you can use MD5 checksums for the SRA file but not the fastq files. Now that I already have the fastq files, is there another way I can check the integrity? Is having the correct number of "reads read/written" printed out after fasterq-dump is finished enough to confirm correct downloading and processing or is there something I'm not thinking of?

ADD COMMENTlink modified 5 weeks ago by genomax73k • written 5 weeks ago by flyingbebegurl0

Consider using ENA instead of SRA: sra-explorer : find SRA and FastQ download URLs in a couple of clicks

ADD REPLYlink written 5 weeks ago by igor8.8k
1
gravatar for genomax
5 weeks ago by
genomax73k
United States
genomax73k wrote:

Use vdb-validate included in sratoolkit. See this help page. Validation section in this blog post is useful too.

ADD COMMENTlink written 5 weeks ago by genomax73k

When I fasterq-dump large files it doesn’t download a .sra file to home/ncbi/public/sra, but rather makes a temporary folder (such as /fasterq.tmp.ubuntu-xenial.1234) which has many GB sized files in it, which then gets deleted automatically. From my limited understanding the .sra file doesn’t … exist in the same way to be able to vbd-validate.

The fasterq-dump instructions https://github.com/ncbi/sra-tools/wiki/HowTo:-fasterq-dump don’t mention needing to prefetch where you would obtain the .sra file to be able to validate and it sounds like fastq-dump will likely to be deprecated in the future according to https://github.com/ncbi/sra-tools so I’m not sure what people will plan to do when using fasterq-dump.

Since I’ve already downloaded TBs of data in this manner (whether that was good or bad), I wonder whether the file number of reads read/written is sufficient to believe that the file was downloaded and processed as expected.

ADD REPLYlink written 5 weeks ago by flyingbebegurl0

I wonder whether the file number of reads read/written is sufficient to believe that the file was downloaded and processed as expected.

You will need to decide if whatever you plan to do can accept a small chance that the data may be corrupt. It is possible that the messages you saw were timeouts (where your network bandwidth/storage likely could not keep up) and the data was eventually downloaded ok. You can compare the number of reads and verify they look ok. If you run into a problem with a corrupt data file (a program in your pipeline will complain) then plan to re-download that particular dataset.

ADD REPLYlink written 5 weeks ago by genomax73k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1121 users visited in the last hour