Is it possible to automatically update NCBI fasta sequences in command-line?
2
0
Entering edit mode
8.2 years ago
hcwang ▴ 50

Hi all,

I downloaded fasta sequences from NCBI FTP site with the method described in http://www.ncbi.nlm.nih.gov/genome/doc/ftpfaq/#allcomplete . Recently, I used my customised database for blast and got many desired results. However, one thing I noticed is that some of the sequences have been updated/removed since the last time I download the whole genome dataset. And some of these removed sequences actually interfered with my results because they prevented the detection of one of my spiked-in organism. Thus, I'm wondering if there is a simple way to update my genome database using command-line tools like eDirect utils. I wish to avoid re-downloading my database at all because that's just a waste of resources.

To give an example, sequence NC_000521.3 is updated to AL844502.1 then to AL844502.2. And sequence NW_001850357.1 has been completely removed from the NCBI database.

ncbi fasta update command-line • 2.2k views
ADD COMMENT
0
Entering edit mode

The example you includes is a difference in the minor version of that accession # so it is not likely to change your blast result significantly. As for sequences that have been removed you may want to remove them based on @Piet/@Matt Shirley's strategies. This would need some careful tweaking but once you create the necessary scripts the process should be reasonably painless.

You have not said what you are doing with this custom database but it sounds like you have a need to repeat the analysis with some frequency. I am not sure why you are worried about resources (unless you are paying for the bandwidth/storage) since getting the right answer has higher priority. This simply can be considered cost of doing bioinformatics.

ADD REPLY
0
Entering edit mode

Thank you for all your responses! To add into @genomax2 's question, I'm building local copies of all virus, bacteria, and fungi databases for detection of organisms from Illumina sequencing runs using local blast and RAPSearch2. Minor versions are alright but some of the removed sequences really altered my results because my 100bp read mapped 100% to the removed NW_001850357.1 and 99% to the species I'm spiked in (Candida albicans). Since I was only taking the top hit as the organism detected, I missed a bunch of the Candida albicans in my report. Bandwidth/storage is not too much a concern but I wouldn't want to re-download everything for each update since I'm using a shared computing server and that may slightly affect other people's projects.

I took a similar approach as @piet did. I've tried with wget -N command. It seems that the server still downloads the file but just not store them. Thus, i wrote a script with the unix command if [ -f $report ]; then echo $report' exists!'; fi to check and skip any downloaded assembly file that's in storage. However, I'm not able to check every sequence in each assembly for the sequence's update. Do you know if it's possible to update a certain sequence only?

I haven't tried @Matt Shirley's approach yet. I'll give it a shot.

One more thing I noticed is that old version/removed genome sequences tend to not have taxon id assigned to them anymore. I'm not sure if this is the case for every sequence. If it is, I can run a taxonomy search for all the sequences and produce all the sequence that don't have taxon id assigned. Then, I may have to manually re-download, replace, and rebuild the database with the new sequences. I've written in another post on how to Retrieve a subset of FASTA from large Illumina multi-FASTA file Retrieve a subset of FASTA from large Illumina multi-FASTA file . It can also be modified to work for the databases.

ADD REPLY
2
Entering edit mode
8.2 years ago
piet ★ 1.8k

I wish to avoid re-downloading my database at all because that's just a waste of resources.

a single assembly of P.falciparum in gzipped FASTA is about 6 Mb, and there are about 20 such assemblies. If you download all of them every few weeks, it will take some minutes. Setting up a clever strategy for mirroring only the differences will keep you busy for several working days and distract you from doing research into the AT deserts of malaria.

As outlined on the 'ftpfaq' page, you should first download the appropriate index file which lists all files available for download. Next you should remove all files from your local copy which are not listed in the index anymore. Then you should download all files in the index with a tool like 'wget' and utilizing option '-N'. With this option, wget will start the download only if the file on the server is newer than your local copy. You should use 'http' protocol instead of 'ftp' protocol, if you use option '-N'.

wget -N http://ftp.ncbi.nlm.nih.gov/genomes/all/GCF_000002765.3_ASM276v1/GCF_000002765.3_ASM276v1_genomic.fna.gz

If you repeat this command a second time, no download will take place and you see '304 Not Modified' in the log stream. HTTP status 304 indicates that the resource on the server was not updated meanwhile.

ADD COMMENT
1
Entering edit mode
8.2 years ago

Use rsync. NCBI's FTP server supports it, and you'll be able to keep a local mirror of a directory without transferring all of the contents during every update. See here for an example of rsync'ing a directory.

ADD COMMENT
0
Entering edit mode

Interesting, I did not know that NCBI supports anonymous rsync protocol.

rsync excels when it comes to mirroring whole directory trees and to move data from one hard drive to another. But in the code snippet from Nick Loman as well as in the use case of this thread, only carefully selected files are copied from a very huge source directory. By no means we want to mirror all the file on the NCBI server, which are also highly redundant since they are offered in several formats. A handsome feature of rsync is, that you can use wildcards to select files on the remote server.

ADD REPLY

Login before adding your answer.

Traffic: 2285 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6