How to makeblastdb Uniprot's Taxonomic Divisions?
4
1
Entering edit mode
6.8 years ago
Eliad ▴ 80

Hi,

I'm interested in creating a Blast database (makeblastdb) from Uniprot's Bacteria division.

I had to turn to the database release on FTP due to the extremely slow download speed of the website's query results.

 

So I went to Uniprot's downloads page:

http://www.uniprot.org/downloads

Then clicked 'Taxonomic divisions' (or in my case the FTP mirror that is closer to my country):

ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/taxonomic_divisions/

And downloaded these two files:

uniprot_sprot_bacteria.dat.gz

uniprot_trembl_bacteria.dat.gz

 

Then when I unzipped, I realized these were not Fasta files.

I can't figure out what is their format or how to create a blast database from it.

 

Any help would be greatly appreciated.

blast makeblastdb uniprot taxonomy • 4.7k views
ADD COMMENT
2
Entering edit mode
5.8 years ago
mjp ▴ 30

Not sure if you are still interested but maybe it will be useful for someone else who stumbles upon similar thing.

There is a CPAN module for conversion of .dat to .fasta, part of InSilicoSpectro-Databanks

It has few options to explore but you would simply do:

uniprotdata2fasta.pl --in=uniprot_sprot.dat --out=uniprot_sprot.fasta

With this conversion in place you should be good to go for your database generation.

ADD COMMENT
0
Entering edit mode
6.8 years ago
5heikki 10k

Are they genbank files? If yes, this could work:

esl-reformat fasta inputFile > output.fasta

Easel tools ship with hmmer.

ADD COMMENT
0
Entering edit mode

Thanks, but these are not genbank files.

These are UniProt Knowledgebase database files as described here:

http://web.expasy.org/docs/userman.html#convent

I think I'll just parse these myself.

The challenge is to use unix pipes (|) to process it from the compressed files all the way to makeblastdb, via sed and\or awk.

I'll post my one-liner once it is done.

ADD REPLY
1
Entering edit mode

I had a look. They are basically genbank files. Here's what I came up with. Line begins with "ID" - print the second column. Line begins with space - print the entire line. Ignore all other lines. Delete spaces.

awk '{if(/^ID/) print ">"$2; else if (/^[[:blank:]]/) print $0}' uniprot_sprot_archaea.dat | tr -d " " > uniprot_sprot_archaea.fasta

Would be more elegant if spaces were deleted in the awk command but whatever.

But maybe you would be more interested in the "AC" (accession) than "ID" lines as templates for fasta headers. I don't know..

ADD REPLY
0
Entering edit mode

Thanks!

I polished it a little:

awk '{if (/^ /) {gsub(/ /, ""); print} else if (/^ID/) print ">" $2}'

So the whole thing looks like this in shellscript:

zcat /path/uniprot_{sprot,trembl}_bacteria.dat.gz \
    | awk '{if (/^ /) {gsub(/ /, ""); print} else if (/^ID/) print ">" $2}' \
    | makeblastdb \
        -out /path/uniprot_bacteria  \
        -dbtype prot -hash_index -title uniprot_bacteria -max_file_sz '50GB'

 

Newbie question: How do I accept your second comment as the answer?

ADD REPLY
0
Entering edit mode
6.8 years ago

Another option to convert UniProt text format into FASTA format (with the "official" UniProt FASTA headers):

Install the Swissknife PERL module (http://swissknife.sourceforge.net/docs/) and then run one of the following 2 programs (both do require Swissknife):

1) for a simple fasta conversion that only includes canonical sequences:
run the attached script as follows:

perl sp_to_fasta uniprot_bacteria.dat > uniprot_bacteria.fasta

2) for a fasta file that includes alternative isoforms, download ftp://ftp.ebi.ac.uk/pub/software/uniprot/varsplic/varsplic.pl and run it locally, e.g. with a command line such as

perl varsplic.pl -input uniprot_bacteria.dat -check_vsps -crosscheck -error varsplic.err -fasta varsplic_bacteria.fasta -which full

 

sp_to_fasta:

# Purpose: 
# Read a file in SP format, write it in FASTA format.
#
# Usage:
# sp_to_fasta SP_file > FASTA_file

use strict;

use IO::File;

use SWISS::Entry;

my $inputfile = @ARGV[0];
my $fh = new IO::File $inputfile or 
    die "Cannot open input file $inputfile: $!";


    $/ = "\n\/\/";
    while(<$fh>) {
        s/\r//g;
        (my $entry_txt = $_) =~ s/^\s+//;
        next unless $entry_txt;
        $entry_txt .= "\n";
        my $entry = SWISS::Entry->fromText( $entry_txt );
        print $entry->toFasta();
    }
 
ADD COMMENT
0
Entering edit mode
6.8 years ago
Arnaud Ceol ▴ 850

You can also try to split your download, for instance to get entries by group of 500:

numentries=XXXX # You have to do a first query on the website to see the number of entries

for i in `seq 0 500 $numentries`; do
wget -O uniprot_$i.fasta "http://www.uniprot.org/uniprot/?sort=score&desc=&compress=no&query=taxonomy:bacteria&fil=&limit=500&force=no&preview=true&format=fasta&offset=$i"
done

 

To create a query, just do a search on the uniprot website, click on download->preview and copy the URL

ADD COMMENT

Login before adding your answer.

Traffic: 723 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6