Question: How to makeblastdb Uniprot's Taxonomic Divisions?
1
gravatar for Eliad
5.2 years ago by
Eliad60
NIBN, BGU, Israel
Eliad60 wrote:

Hi,

I'm interested in creating a Blast database (makeblastdb) from Uniprot's Bacteria division.

I had to turn to the database release on FTP due to the extremely slow download speed of the website's query results.

 

So I went to Uniprot's downloads page:

http://www.uniprot.org/downloads

Then clicked 'Taxonomic divisions' (or in my case the FTP mirror that is closer to my country):

ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/taxonomic_divisions/

And downloaded these two files:

uniprot_sprot_bacteria.dat.gz

uniprot_trembl_bacteria.dat.gz

 

Then when I unzipped, I realized these were not Fasta files.

I can't figure out what is their format or how to create a blast database from it.

 

Any help would be greatly appreciated.

ADD COMMENTlink modified 4.2 years ago by mjp30 • written 5.2 years ago by Eliad60
2
gravatar for mjp
4.2 years ago by
mjp30
USA
mjp30 wrote:

Not sure if you are still interested but maybe it will be useful for someone else who stumbles upon similar thing.

There is a CPAN module for conversion of .dat to .fasta, part of InSilicoSpectro-Databanks

It has few options to explore but you would simply do:

uniprotdata2fasta.pl --in=uniprot_sprot.dat --out=uniprot_sprot.fasta

With this conversion in place you should be good to go for your database generation.

ADD COMMENTlink written 4.2 years ago by mjp30
0
gravatar for 5heikki
5.2 years ago by
5heikki9.0k
Finland
5heikki9.0k wrote:

Are they genbank files? If yes, this could work:

esl-reformat fasta inputFile > output.fasta

Easel tools ship with hmmer.

ADD COMMENTlink written 5.2 years ago by 5heikki9.0k

Thanks, but these are not genbank files.

These are UniProt Knowledgebase database files as described here:

http://web.expasy.org/docs/userman.html#convent

I think I'll just parse these myself.

The challenge is to use unix pipes (|) to process it from the compressed files all the way to makeblastdb, via sed and\or awk.

I'll post my one-liner once it is done.

ADD REPLYlink written 5.2 years ago by Eliad60
1

I had a look. They are basically genbank files. Here's what I came up with. Line begins with "ID" - print the second column. Line begins with space - print the entire line. Ignore all other lines. Delete spaces.

awk '{if(/^ID/) print ">"$2; else if (/^[[:blank:]]/) print $0}' uniprot_sprot_archaea.dat | tr -d " " > uniprot_sprot_archaea.fasta

Would be more elegant if spaces were deleted in the awk command but whatever.

But maybe you would be more interested in the "AC" (accession) than "ID" lines as templates for fasta headers. I don't know..

ADD REPLYlink modified 5.2 years ago • written 5.2 years ago by 5heikki9.0k

Thanks!

I polished it a little:

awk '{if (/^ /) {gsub(/ /, ""); print} else if (/^ID/) print ">" $2}'

So the whole thing looks like this in shellscript:

zcat /path/uniprot_{sprot,trembl}_bacteria.dat.gz \
    | awk '{if (/^ /) {gsub(/ /, ""); print} else if (/^ID/) print ">" $2}' \
    | makeblastdb \
        -out /path/uniprot_bacteria  \
        -dbtype prot -hash_index -title uniprot_bacteria -max_file_sz '50GB'

 

Newbie question: How do I accept your second comment as the answer?

ADD REPLYlink modified 5.2 years ago • written 5.2 years ago by Eliad60
0
gravatar for Elisabeth Gasteiger
5.2 years ago by
Geneva
Elisabeth Gasteiger1.8k wrote:

Another option to convert UniProt text format into FASTA format (with the "official" UniProt FASTA headers):

Install the Swissknife PERL module (http://swissknife.sourceforge.net/docs/) and then run one of the following 2 programs (both do require Swissknife):

1) for a simple fasta conversion that only includes canonical sequences:
run the attached script as follows:

perl sp_to_fasta uniprot_bacteria.dat > uniprot_bacteria.fasta

2) for a fasta file that includes alternative isoforms, download ftp://ftp.ebi.ac.uk/pub/software/uniprot/varsplic/varsplic.pl and run it locally, e.g. with a command line such as

perl varsplic.pl -input uniprot_bacteria.dat -check_vsps -crosscheck -error varsplic.err -fasta varsplic_bacteria.fasta -which full

 

sp_to_fasta:

# Purpose: 
# Read a file in SP format, write it in FASTA format.
#
# Usage:
# sp_to_fasta SP_file > FASTA_file

use strict;

use IO::File;

use SWISS::Entry;

my $inputfile = @ARGV[0];
my $fh = new IO::File $inputfile or 
    die "Cannot open input file $inputfile: $!";


    $/ = "\n\/\/";
    while(<$fh>) {
        s/\r//g;
        (my $entry_txt = $_) =~ s/^\s+//;
        next unless $entry_txt;
        $entry_txt .= "\n";
        my $entry = SWISS::Entry->fromText( $entry_txt );
        print $entry->toFasta();
    }
 
ADD COMMENTlink written 5.2 years ago by Elisabeth Gasteiger1.8k
0
gravatar for Arnaud Ceol
5.2 years ago by
Arnaud Ceol850
Milan, Italy
Arnaud Ceol850 wrote:

You can also try to split your download, for instance to get entries by group of 500:

numentries=XXXX # You have to do a first query on the website to see the number of entries

for i in `seq 0 500 $numentries`; do
wget -O uniprot_$i.fasta "http://www.uniprot.org/uniprot/?sort=score&desc=&compress=no&query=taxonomy:bacteria&fil=&limit=500&force=no&preview=true&format=fasta&offset=$i"
done

 

To create a query, just do a search on the uniprot website, click on download->preview and copy the URL

ADD COMMENTlink written 5.2 years ago by Arnaud Ceol850
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1021 users visited in the last hour