Help with massive SRA download of mixed file types (SE fastq, PE fastq, bam)
3
0
Entering edit mode
2.1 years ago
jmnz22 ▴ 10

Hi!

I have a 1000 SRA accessions list with mixed file formats, single-end fastq, paired fastq, and bam files. The problem is I don't have information on what type is each SRA accession, just a list of them (unless I patiently check one by one on SRA website which I hope is not the solution).

Is there a way to know what is what in order to use the correct *-dump command? Can I use fasterq-dump with --split3 to download all? or if I do that the se-fastq and the bams will be corrupted in some way (or not downloaded)?

I want to believe there is a way to make sra-toolkit able to identify each file type and download it accordingly (fingers-crossed!!)

I would appreciate any help with it!! Thanks!!! Best regards!

sra-toolkit faster-dump SRA • 1.6k views
ADD COMMENT
0
Entering edit mode

Another solution is to use sra-explorer (https://sra-explorer.info/ ) and get direct links for fastq files as a bash script or aspera download links. For that many accessions it would be best to stay away from sratoolkit.

ADD REPLY
0
Entering edit mode

Thank you GenoMax , I try to use 'sra-explorer' but it works perfectly fine with individual accessions (GSMxxx or SRRxxxx) but I can not make it work given to it multiple accessions. The query changes automatically to 'SRRxxx[All Fields] AND SRRxxx[All Fields]' and it tries to load some results but stops and shows nothing. Not sure how it can work with more than one query.

The good thing is it offers you some alternative tools at the bottom of the webpage.

ADD REPLY
1
Entering edit mode
2.1 years ago
jmnz22 ▴ 10

Hi!!

As I comment before, in the sra-explorer webpage, at the bottom, are mentioned three tools; nf-core/fetchngs, pysradb, fetchfastq.

I discarded the first one by the comments of GenoMax and jared.andrews07 and tried the other two.

With pysradb I couldn't make it work. Any SRRxxx that I gave to him gave me in return a "not found in database" and the same with its function gsm-to-srr, which raised the same error (when you try to run it for the first time it download a database to use with their commands to fetch the metadata or download).

And I think the third one is more promising, fetchfastq was able to find any accession with which I feed the tool and in return, it gives you a json file with all the metadata for that accession, parsing SRR and GSM (and others,...).

In these json files appear all the metadata, including the type and the different URL to download (fetchfastq has in the user guide a description of how to use it to retrieve only the URL and pipe it to curl, although I didn't try it yet).

From here I guess It should be feasible to loop and get all the URL's to download the files as they are, plus, there is room to parse the json file to make a huge metadata-base to have all information regarding the files for downstream analysis/uses/etc. But I am not as good with scripting to manage such kind of task :=)

The direct link to this tool: https://github.com/pachterlab/ffq

Any further thoughts? XD

Thanks!

ADD COMMENT
1
Entering edit mode

This is still not going to solve the issue of getting any data submitted to SRA that is available under Data Access tab e.g. BAM files. Sometimes these are needed to recreate fastq files in original format for 10x or PacBio data.

ADD REPLY
0
Entering edit mode
2.1 years ago

I'd just yoink 'em all with the nf-core fetchngs pipeline. May not snag the bams though. To do that you can try sam-dump as shown here., you seemingly need to use a cloud bucket, e.g. AWS or GCP.

ADD COMMENT
0
Entering edit mode

@Jared are you sure that the sam-dump is correct? Perhaps OP is referring to BAM files that are present in Data Access tab. I don't think those can be downloaded using sratools.

ADD REPLY
0
Entering edit mode

Nope, I am not. I've never tried to download BAMs from SRA. Looks like you're correct, you need to use a cloud platform, e.g. AWS or GCP to get those.

ADD REPLY
0
Entering edit mode

Also the problem would be to know what kind of file is each accession, right? this is one of the tools mentioned in sra-explorer...

ADD REPLY
1
Entering edit mode

The nfcore pipeline doesn't care, but it's not going to get the bams for you. It will grab the FASTQs properly at least. If there's one FASTQ, it's most likely single-end, if two, it's paired. In combination with Genomax's answer below to retrieve the metadata in a programmatic way, it'd likely get you most of the way there for the FASTQs.

The BAMs are a different beast.

ADD REPLY
0
Entering edit mode
2.1 years ago
GenoMax 142k

Using EntrezDirect will get you the download locations for I think the .sra data. You will need to fastq-dump the data locally but you know which of the data is single and paired-end.

$ esearch -db sra -query SRP043510 | efetch -format runinfo | awk -F "," '{print $10,$16,$19,$20}'
download_path LibraryLayout Platform Model
https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR1448774/SRR1448774 SINGLE ILLUMINA Illumina HiSeq 2000
https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR1448775/SRR1448775 SINGLE ILLUMINA Illumina HiSeq 2000
https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR1448776/SRR1448776 SINGLE ILLUMINA Illumina HiSeq 2000
https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR1448777/SRR1448777 SINGLE ILLUMINA Illumina HiSeq 2000
https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR1448778/SRR1448778 SINGLE ILLUMINA Illumina HiSeq 2000
https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR1448779/SRR1448779 SINGLE ILLUMINA Illumina HiSeq 2000
https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR1448780/SRR1448780 SINGLE ILLUMINA Illumina HiSeq 2000
https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR1448781/SRR1448781 SINGLE ILLUMINA Illumina HiSeq 2000
ADD COMMENT
0
Entering edit mode

Uhmmm, this looks really helpful to retrieve the data I need to check all the accessions quickly too. The fetch-fastq has this option, which gives you the links to the real file, which probably is helpful too, but probably slower than using fastq-dump or fasterq-dump. But is also true that if it finds a bam it will download the bam without messing with the data, I guess.

for the same SRP you used:

$ ffq --ftp SRP043510
(some of the output....)
SRR1448787      fastq   1   ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR144/007/SRR1448787/SRR1448787.fastq.gz
SRR1448788      fastq   1   ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR144/008/SRR1448788/SRR1448788.fastq.gz
SRR1448789      fastq   1   ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR144/009/SRR1448789/SRR1448789.fastq.gz
SRR1448790      fastq   1   ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR144/000/SRR1448790/SRR1448790_1.fastq.gz
SRR1448790      fastq   2   ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR144/000/SRR1448790/SRR1448790_2.fastq.gz
SRR1448774      fastq   1   ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR144/004/SRR1448774/SRR1448774.fastq.gz
SRR1448775      fastq   1   ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR144/005/SRR1448775/SRR1448775.fastq.gz
SRR1448776      fastq   1   ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR144/006/SRR1448776/SRR1448776.fastq.gz

and if you go for the json, it gives you this full output (which can be modified in the number of the desired level of info you want):

ffq -o SRP043510.json SRP043510

output: https://ln5.sync.com/dl/b462d0e20/xtn3dadn-ijvvmibu-ngnvrhb7-7gjmh8uq

I think is useful to talk about these options, probably they are somewhere on the web, but I couldn't find them in an easy way, so hope this can help more people too!!

ADD REPLY

Login before adding your answer.

Traffic: 1037 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6