problem with setting up uniprot database for Diamond BLAST
0
0
Entering edit mode
3.3 years ago
slin023 • 0

Hello, I have encountered a "uniprot database" problem for diamond blast . I have question about these commands on the tutorial [https://blobtoolkit.genomehubs.org/install/, https://github.com/blobtoolkit/blobtools2/issues/6], but those don't seem to be working for me:

# extract and concatenate protein FASTA files
touch reference_proteomes.fasta.gz
find . -mindepth 2 | grep "fasta.gz" | grep -v 'DNA' | grep -v 'additional' | xargs cat >> reference_proteomes.fasta.gz

I did not receive any error message,but it just creates a "reference_proteomes.fasta.gz" with 0B, so reference_proteomes.fasta.gz created by this command is pretty much empty (see the screencap) enter image description here

Here are all the list of " _.fasta.gz" looks like in the "uniprot" folder: enter image description here

Any suggestion for how to revise this command based upon the file names :?

find . -mindepth 2 | grep "fasta.gz" | grep -v 'DNA' | grep -v 'additional' | xargs cat >> reference_proteomes.fasta.gz"

Please let me know, thank you!

genome Assembly blobtool • 1.8k views
ADD COMMENT
0
Entering edit mode

Did you run these commands successfully before:

mkdir -p uniprot

wget -q -O uniprot/reference_proteomes.tar.gz \ ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/$(curl \ -vs ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/ 2>&1 | \ awk '/tar.gz/ {print $9}')

cd uniprot

tar xf reference_proteomes.tar.gz

Are you inside the folder uniprot and are you using Linux command-line?

ADD REPLY
0
Entering edit mode

Yes, this runs successfully, the file name you see on the pic is after I tar xf reference_proteomes.tar.gz , the other commands echo "accession\taccession.version\ttaxid\tgi" > reference_proteomes.taxid_map zcat */*.idmapping.gz | grep "NCBI_TaxID" | awk '{print $1 "\t" $1 "\t" $3 "\t" 0}' >> reference_proteomes.taxid_map also works for me

ADD REPLY
0
Entering edit mode

Do the following command:

find . -mindepth 2 | grep "fasta.gz" | grep -v 'DNA' | grep -v 'additional' | wc -l

This will tell if you are finding any files to cat next and how many files.

ADD REPLY
0
Entering edit mode

I typed it, and it shows " 0 ".

ADD REPLY
1
Entering edit mode

So, that is your problem. You are not getting any fasta files without DNA or additional.

If you search in the subfolders (at least below 2 subfolders) of the uniprot folder do you see fasta.gz files without DNA or additional in their names?

It seems to me that you have from the print, so, I don't know why the command is failing...

Do:

find . -mindepth 2 | grep "fasta.gz" | head

ADD REPLY
0
Entering edit mode

Those subfolders are all empty. I took all the at least 50~60k files out of all sub folders. And yes, there are some fasta.gz files without DNA or additional names in it. Take UP000326979_1803180_ tax ID as example:

UP000326979_1803180_DNA.fasta.gz
UP000326979_1803180.fasta.gz
UP000326979_1803180.gene2acc.gz
UP000326979_1803180.idmapping.gz

However, I tried your command, it seems working now, not empty anymore: enter image description here thank you very much for your help!

ADD REPLY

Login before adding your answer.

Traffic: 2785 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6