Question

Downloading a huge amount of Fastq files

0

Entering edit mode

21 months ago

davidmaimoun ▴ 50

Hi,

I need to download a huge amount of fastqs to create an index (~400K)

Because I used to do my downloads via SRA tool, but with such amount of data it will take several years to do it.

Do you have any suggestion ?

Thank you

fastq • 2.2k views

ADD COMMENT • link updated 13 months ago by Ram 43k • written 21 months ago by davidmaimoun ▴ 50

1

Entering edit mode

Would you mind adding some details. What do you download, what is "huge", why do you think it takes "years", what are your commands and what is the problem? Please try to be a bit professional rather than so overdramatic ("years", "huge").

Most likely the answer is to do batch downloads from ENA via Aspera, but lets see first what you actually want to download.

ADD REPLY • link 21 months ago by ATpoint 81k

0

Entering edit mode

Yes sorry,

By huge I meant about 400K genomes of salmonella. And by years, I really don't know. For instance, I wanted to try COBS for indexing the genomes. So I went to 'pathogens detection' to get Sra accession code, I downloaded 10 (each one have 2 read 1 and 2) sequences via sratools. It takes me about 30 min to get them all.

Here the command: ./fasterq-dump SRR19754566 SRR19754569 SRR19753926 SRR19753927 SRR19737889 SRR19735935 SRR19735863 SRR19733910 SRR19733911 -p -e 20

Thank you

ADD REPLY • link 21 months ago by davidmaimoun ▴ 50

2

Entering edit mode

If you are getting data from SRA then these are not really genomes this is simply raw data that was submitted to SRA. If you actually need genomes then you need to be looking at genome section of NCBI site.

If you do want to download the raw data then you will want to use sra-explorer.info to get direct fastq download links: sra-explorer : find SRA and FastQ download URLs in a couple of clicks

Then use Aspera to download: Setting up Aspera Connect (ascp) on Linux and macOS

Beyond the technical challenge of downloading bigger issue you will face is likely local storage and enough resources to deal with the index creation etc.

ADD REPLY • link 21 months ago by GenoMax 141k

1

Entering edit mode

Try the ASPERA client CLI.

ADD REPLY • link 21 months ago by Arup Ghosh 3.2k

0

Entering edit mode

Thank you for the help, it seem that Aspera is the better solution for now. I need to check the docs to see how it is working. For the storage we will use a cloud system, and for the indexing, COBS seems to me good. But may be there are better solutions. I am really new in the field, and I am still learning on these technologies.

Mensur Dlakic I agree with you it will be quite of challenging, but there must be a way,

I am working in a pathogen detection of my minstry of health. we are dealing with salmonella all the time Normally we use NCBI Blast. But it works only in the assemblies, and doen't cover all the data data submitted for the specie. In the NCBI Pathogen Detection I saw that there are more than 400K isolates of salmonella, and it will be useful for us to compare our own isolates to them

ADD REPLY • link 21 months ago by davidmaimoun ▴ 50

1

Entering edit mode

I agree with you it will be quite of challenging, but there must be a way,

Certainly there is a way, but it will take a while and require tremendous resources.

As I said, it isn't just the download that you need to think about. Assuming that a dataset is 5 Gb on average, which I think is a low value, you'd need >1900 Tb of disk space to store 400K datasets. I will let you calculate the bandwidth and the time required to do so because I don't know the speed of your network. The fun only begins in terms of organization and processing even if you manage the download. Good luck.

ADD REPLY • link 21 months ago by Mensur Dlakic ★ 27k

0

Entering edit mode

I agree with you, I'll think about how to do it

Thank you for your help

ADD REPLY • link 21 months ago by davidmaimoun ▴ 50

1

Entering edit mode

NCBI Pathogen Detection I saw that there are more than 400K isolates of salmonella, and it will be useful for us to compare our own isolates to them

Pathogen detection portal also lists clusters of samples so you don't need to download each and every isolate.

There are 13450 Salmonella enterica genomes available at NCBI: https://www.ncbi.nlm.nih.gov/genome/browse#!/overview/salmonella Many are listed in the pathogen portal. My guess would be they probably represent most (if not all) the isolates.

isolates

If you are going to do this exercise for a national surveillance center the using assemblies may be better than using raw data.

ADD REPLY • link 21 months ago by GenoMax 141k

0

Entering edit mode

So do you think it will be enough to deal with these 13450?

It will be great because that mean dealing with fastas, and it is very convenient for me

Thank you

ADD REPLY • link 21 months ago by davidmaimoun ▴ 50

1

Entering edit mode

Starting with assemblies listed in the pathogen portal (they are associated with specific strains, AMR genotypes etc) should provide a good start.

I looked at ~20K pages at the pathogen portal (there are ~22K total) and they had assemblies listed. So only a small fraction of isolates will not have an assembly associated with them.

ADD REPLY • link 21 months ago by GenoMax 141k

0

Entering edit mode

Thanks a lot it was very very helpful

ADD REPLY • link 21 months ago by davidmaimoun ▴ 50

score 1 · Answer 1 · 2022-06-29

I have hard time believing that there are 400K genomes of salmonella deposited. But even if that's the case, there has to be a considerable redundancy between them. Let's say that you can download them at 1 minute per dataset. Do you really think that downloading will be your main problem? How about storing that much data? Or processing and analyzing 400K datasets? Even if you had the bandwidth and computational power to do everything in parallel, I think it would be wildly optimistic to assume it would take you on average 10 minutes per dataset to do all of this. And 10 minutes times 400K is more than 7 years.

Rather than trying to find a way to download 400K sets of files, it may be better to intelligently arrive at a smaller subset that will still serve your purpose and be feasible to work with.

score 1 · Answer 2 · 2022-06-30

Use aspera CLI will dramatically speed up your download. I heard that you can download with full speed only if your terminal is located within USA (actually I personally tested it before. By connecting a VPN node in USA, I got folds increase of download speed).

In some case the database provide a fasta-aspera link so you can directly download it. In other case, you can modify your ftp URL to download your file with Aspera.

For instance, a ftp URL like ftp://ftp.sra.ebi.ac.uk/vol1/..... can be replaced by era-fasp@fasp.sra.ebi.ac.uk:/vol1/...

a ftp URL like ftp://ftp.ebi.ac.uk/databases/.... can be replaced by fasp-ebi@fasp.ebi.ac.uk:databases/...

At least from my experience, I always use these two "headers" to download fastq via Aspera. Hope this could help.