Question: Is it possible to automatically update NCBI fasta sequences in command-line?
0
gravatar for hcwang
3.5 years ago by
hcwang50
Vancouver, Canada
hcwang50 wrote:

Hi all,

I downloaded fasta sequences from NCBI FTP site with the method described in http://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/#allcomplete . Recently, I used my customised database for blast and got many desired results. However, one thing I noticed is that some of the sequences have been updated/removed since the last time I download the whole genome dataset. And some of these removed sequences actually interfered with my results because they prevented the detection of one of my spiked-in organism. Thus, I'm wondering if there is a simple way to update my genome database using command-line tools like eDirect utils. I wish to avoid re-downloading my database at all because that's just a waste of resources.

To give an example, sequence NC_000521.3 is updated to AL844502.1 then to AL844502.2. And sequence NW_001850357.1 has been completely removed from the NCBI database.

command-line update fasta ncbi • 1.1k views
ADD COMMENTlink modified 3.5 years ago by Matt Shirley9.2k • written 3.5 years ago by hcwang50

The example you includes is a difference in the minor version of that accession # so it is not likely to change your blast result significantly. As for sequences that have been removed you may want to remove them based on @Piet/@Matt Shirley's strategies. This would need some careful tweaking but once you create the necessary scripts the process should be reasonably painless.

You have not said what you are doing with this custom database but it sounds like you have a need to repeat the analysis with some frequency. I am not sure why you are worried about resources (unless you are paying for the bandwidth/storage) since getting the right answer has higher priority. This simply can be considered cost of doing bioinformatics.

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by genomax75k

Thank you for all your responses! To add into @genomax2 's question, I'm building local copies of all virus, bacteria, and fungi databases for detection of organisms from Illumina sequencing runs using local blast and RAPSearch2. Minor versions are alright but some of the removed sequences really altered my results because my 100bp read mapped 100% to the removed NW_001850357.1 and 99% to the species I'm spiked in (Candida albicans). Since I was only taking the top hit as the organism detected, I missed a bunch of the Candida albicans in my report. Bandwidth/storage is not too much a concern but I wouldn't want to re-download everything for each update since I'm using a shared computing server and that may slightly affect other people's projects.

I took a similar approach as @piet did. I've tried with wget -N command. It seems that the server still downloads the file but just not store them. Thus, i wrote a script with the unix command if [ -f $report ]; then echo $report' exists!'; fi to check and skip any downloaded assembly file that's in storage. However, I'm not able to check every sequence in each assembly for the sequence's update. Do you know if it's possible to update a certain sequence only?

I haven't tried @Matt Shirley's approach yet. I'll give it a shot.

One more thing I noticed is that old version/removed genome sequences tend to not have taxon id assigned to them anymore. I'm not sure if this is the case for every sequence. If it is, I can run a taxonomy search for all the sequences and produce all the sequence that don't have taxon id assigned. Then, I may have to manually re-download, replace, and rebuild the database with the new sequences. I've written in another post on how to Retrieve a subset of FASTA from large Illumina multi-FASTA file Retrieve a subset of FASTA from large Illumina multi-FASTA file . It can also be modified to work for the databases.

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by hcwang50
2
gravatar for piet
3.5 years ago by
piet1.7k
planet earth
piet1.7k wrote:

I wish to avoid re-downloading my database at all because that's just a waste of resources.

a single assembly of P.falciparum in gzipped FASTA is about 6 Mb, and there are about 20 such assemblies. If you download all of them every few weeks, it will take some minutes. Setting up a clever strategy for mirroring only the differences will keep you busy for several working days and distract you from doing research into the AT deserts of malaria.

As outlined on the 'ftpfaq' page, you should first download the appropriate index file which lists all files available for download. Next you should remove all files from your local copy which are not listed in the index anymore. Then you should download all files in the index with a tool like 'wget' and utilizing option '-N'. With this option, wget will start the download only if the file on the server is newer than your local copy. You should use 'http' protocol instead of 'ftp' protocol, if you use option '-N'.

wget -N http://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000002765.3_ASM276v1/GCF_000002765.3_ASM276v1_genomic.fna.gz

If you repeat this command a second time, no download will take place and you see '304 Not Modified' in the log stream. HTTP status 304 indicates that the resource on the server was not updated meanwhile.

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by piet1.7k
1
gravatar for Matt Shirley
3.5 years ago by
Matt Shirley9.2k
Cambridge, MA
Matt Shirley9.2k wrote:

Use rsync. NCBI's FTP server supports it, and you'll be able to keep a local mirror of a directory without transferring all of the contents during every update. See here for an example of rsync'ing a directory.

ADD COMMENTlink written 3.5 years ago by Matt Shirley9.2k

Interesting, I did not know that NCBI supports anonymous rsync protocol.

rsync excels when it comes to mirroring whole directory trees and to move data from one hard drive to another. But in the code snippet from Nick Loman as well as in the use case of this thread, only carefully selected files are copied from a very huge source directory. By no means we want to mirror all the file on the NCBI server, which are also highly redundant since they are offered in several formats. A handsome feature of rsync is, that you can use wildcards to select files on the remote server.

ADD REPLYlink written 3.5 years ago by piet1.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1307 users visited in the last hour