Question: Checking integrity of SRA downloaded fastq files
0
gravatar for flyingbebegurl
15 months ago by
flyingbebegurl0 wrote:

I have been downloading data using the SRA toolkit with prefetch + fastq-dump and more recently fasterq-dump. I have had a variety of messages that come up when I was in the troubleshooting process, but I still get this message regularly: fasterq-dump.2.9.6 sys: timeout exhausted while reading file within network system module - mbedtls_ssl_read returned-76 ( NET - Reading information from the socket failed ) - the process continues and eventually I get the readout stating the number of spots and reads read.

I'm now curious about checking the integrity of the downloaded data - I've read that you can use MD5 checksums for the SRA file but not the fastq files. Now that I already have the fastq files, is there another way I can check the integrity? Is having the correct number of "reads read/written" printed out after fasterq-dump is finished enough to confirm correct downloading and processing or is there something I'm not thinking of?

ADD COMMENTlink modified 5 months ago by simplitia40 • written 15 months ago by flyingbebegurl0

Consider using ENA instead of SRA: sra-explorer : find SRA and FastQ download URLs in a couple of clicks

ADD REPLYlink written 15 months ago by igor12k
1
gravatar for GenoMax
15 months ago by
GenoMax94k
United States
GenoMax94k wrote:

Use vdb-validate included in sratoolkit. See this help page. Validation section in this blog post is useful too.

ADD COMMENTlink written 15 months ago by GenoMax94k

When I fasterq-dump large files it doesn’t download a .sra file to home/ncbi/public/sra, but rather makes a temporary folder (such as /fasterq.tmp.ubuntu-xenial.1234) which has many GB sized files in it, which then gets deleted automatically. From my limited understanding the .sra file doesn’t … exist in the same way to be able to vbd-validate.

The fasterq-dump instructions https://github.com/ncbi/sra-tools/wiki/HowTo:-fasterq-dump don’t mention needing to prefetch where you would obtain the .sra file to be able to validate and it sounds like fastq-dump will likely to be deprecated in the future according to https://github.com/ncbi/sra-tools so I’m not sure what people will plan to do when using fasterq-dump.

Since I’ve already downloaded TBs of data in this manner (whether that was good or bad), I wonder whether the file number of reads read/written is sufficient to believe that the file was downloaded and processed as expected.

ADD REPLYlink written 15 months ago by flyingbebegurl0

I wonder whether the file number of reads read/written is sufficient to believe that the file was downloaded and processed as expected.

You will need to decide if whatever you plan to do can accept a small chance that the data may be corrupt. It is possible that the messages you saw were timeouts (where your network bandwidth/storage likely could not keep up) and the data was eventually downloaded ok. You can compare the number of reads and verify they look ok. If you run into a problem with a corrupt data file (a program in your pipeline will complain) then plan to re-download that particular dataset.

ADD REPLYlink written 15 months ago by GenoMax94k

I am checking the integrity of my downloaded SRRxxx.fastq files because it is very large (eg. 70G each; paired end). After downloading I typed:

vdb-validate -V                 
vdb-validate : 2.10.6

Then I tested my it on FASTQ files

vdb-validate SRR5048028_1.fastq

Nothing showed up. People has shown online that it should shown "xxx is OK". What happened in my situation?

ADD REPLYlink written 7 months ago by Kai_Qi100
1

As far as I know, you can validate files with .sra extension, which are temporary files created by fastq-dump while downloading data from SRA.

ADD REPLYlink written 7 months ago by Benedek Dankó10
1

Thank you. your reply works

ADD REPLYlink written 7 months ago by Kai_Qi100
0
gravatar for simplitia
5 months ago by
simplitia40
simplitia40 wrote:

I ran into the same issue in that the vdb-validate does not work if you are downloading the fastq files without prefetching the sra, you are left with only a fastq file which cannot be validated. It strange that the there isn't a simple checksum that is provided, may be we can get the admin to include this? Anyhow, if you download the SRA meta file it will give a you a list of total bases in one of the columns. What I did next was to just count the total bases for either the _1 or _2 and mulitple by 2 if its paired. Single just leave as is.

Here is a simple sh script that you can use. It will output the total base and you can then match that.

#!/bin/bash
F=$1

gzip -dc $F |
     awk 'NR%4==2{c++; l+=length($0)}
          END{
                print l;
              }'
ADD COMMENTlink modified 5 months ago • written 5 months ago by simplitia40

Actually, when you download the fastq files in will generate sra file in a place where you set in the SRA toolkit. You can use vdb-validate to test. I used to delete them, but after the above mentioned discussion, I will use them first and them delete them.

Hope it helps,

Kai

ADD REPLYlink written 4 months ago by Kai_Qi100
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 937 users visited in the last hour