download billion of fasta sequences
1
0
Entering edit mode
18 months ago

Please help me how to download billion of fasta sequences quickly using the python script, shell script, or an awk script

If so please provide the script

fasta • 909 views
ADD COMMENT
6
Entering edit mode

Why do you need a billion fasta sequences?

Download what? From where? How are you going to store all this data? Is it one file or a billion?

If you expect to be able to work on a billion files, you are going to run into LOTS of additional problems.

This question is unanswerable as it stands.

ADD REPLY
4
Entering edit mode
18 months ago

Here is a way to get half a billion sequences quite quickly actually:

Get the blast NR database:

time update_blastdb.pl --decompress --source aws --num_threads 10 nr                                                                                        

it downloads 384 GB of data in about an hour (I sure have fast internet here!) and prints:

Connected to AWS

real    65m29.036s
user    9m16.289s
sys     65m22.974s

if you want to turn that into fasta (don't do it though!) you could then do:

blastdbcmd -db nr -entry all > halfbillion.fa
ADD COMMENT
1
Entering edit mode

I was baffled by this answer, and now even more so that it got 3 votes.

One could argue that the OP asked for any billion fasta sequences as the question is not worded with enough detail, but that is unlikely to be the case. I don't think it is best practice to give an answer that could potentially tie up the network for hours and require 384 GB of disk space without being clear that's what the OP wants. When we add that most people outside of educational and government institutions can't download 384 GB in one hour, or that update_blastdb.pl doesn't come standard on most systems, it doesn't seem at all like an answer to this question - regardless of the fact that it was not asked with enough details.

ADD REPLY
0
Entering edit mode

I should have posted with a "tongue in cheek" symbol, making it clear I was somewhat joking.

as I read the original post, I got curious about how one would realistically get a billion fasta sequences from the web - and I also happened to need to download nr for work - thus the "answer"

in a nutshell, I really don't see how a regular person could make a billion requests over the web to download fasta files or how they would even organize those files on their system without introducing various breakage of basic commands.

but as it turns out, a blast database is a perfectly usable way to both download and maintain that information - and I posted with information on sizes and download speeds primarily because I was hoping it would be educational that way. If anyone needs to store/distribute very large number of fasta files, storing them as a blast database might be a good way to go about it.

ADD REPLY
0
Entering edit mode

Could save a lot of bandwidth by downloading reads from SRA and then just converting them to fasta..

ADD REPLY

Login before adding your answer.

Traffic: 1507 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6