BBTools/BBMap/BBSuite's sketch.sh and sendsketch.sh are inconsistent
1
0
Entering edit mode
2.2 years ago
O.rka ▴ 390

I want to use sketch.sh and sendsketch.sh on a marine metagenome that have both bacteria and eukarya. I was testing it on a new Cylindrotheca (diatom eukaryote) assembly and it actually recognized it! Nothing else I've tried has been able to do this so I'm pretty stoked. However, I think there may be an error with sketch.sh b/c it looks like the single option is actually only grabbing the first sequence. Check out the Query: contig_1 and Query: scaffolds.drop_bacteria.fasta.

I've checked all of the default settings and they seem to line up.

(metagenomics_env) -bash-4.1$sketch.sh in=scaffolds.drop_bacteria.fasta out=scaffolds.drop_bacteria.fasta.sketch1 java -ea -Xmx5878m -Xms5878m -cp /usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/metagenomics_env/opt/bbmap-38.58-0/current/ sketch.SketchMaker in=scaffolds.drop_bacteria.fasta out=scaffolds.drop_bacteria.fasta.sketch1 Executing sketch.SketchMaker [in=scaffolds.drop_bacteria.fasta, out=scaffolds.drop_bacteria.fasta.sketch1] Finished sketching: 3.140 seconds. Memory: max=6163m, total=6163m, free=6043m, used=120m Wrote 1 of 1 sketches. Time: 3.456 seconds. Reads Processed: 738 0.21k reads/sec Bases Processed: 82339k 23.82m bases/sec  #### Use the sketch from sketch.sh with sendsketch.sh (Query: contig_1) This was not the expected result. (metagenomics_env) -bash-4.1$ sendsketch.sh in=scaffolds.drop_bacteria.fasta.sketch1
Warning!  Cannot find blacklist_refseq_species_250.sketch /usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/metagenomics_env/opt/bbmap-38.58-0/current/blacklist_refseq_species_250.sketch
java.lang.Exception
at dna.Data.findPath(Data.java:1244)
at sketch.Blacklist.refseqBlacklist(Blacklist.java:155)
at sketch.SendSketch.<init>(SendSketch.java:200)
at sketch.SendSketch.main(SendSketch.java:50)
Loaded 1 sketch in 0.036 seconds.

Query: contig_1 DB: RefSeq  SketchLen: 26366    Seqs: 738   Bases: 82339741 gSize: 55126726 File: scaffolds.drop_bacteria.fasta
WKID    KID ANI Complt  Contam  Matches Unique  TaxID   gSize   gSeqs   taxName
0.03%   0.01%   69.28%  100.00% 0.40%   3   2   556484  25154K  88  Phaeodactylum tricornutum CCAP 1055/1
0.04%   0.00%   70.87%  4.72%   0.62%   3   2   75366   1135M   31277   Sinocyclocheilus grahami
0.01%   0.00%   67.16%  31.41%  0.41%   3   1   6412    174854K 1991    Helobdella robusta
0.01%   0.00%   67.15%  31.71%  0.41%   3   0   743375  173229K 1483    Polistes dominula

Total Time:     0.908 seconds.


#### Create the sketch with sendsketch.sh ( Query: scaffolds.drop_bacteria.fasta)

This was the expected result.

(metagenomics_env) -bash-4.1$sendsketch.sh in=scaffolds.drop_bacteria.fasta outsketch=scaffolds.drop_bacteria.fasta.outsketch Warning! Cannot find blacklist_refseq_species_250.sketch /usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/metagenomics_env/opt/bbmap-38.58-0/current/blacklist_refseq_species_250.sketch java.lang.Exception at dna.Data.findPath(Data.java:1244) at sketch.Blacklist.refseqBlacklist(Blacklist.java:155) at sketch.SendSketch.setFromAddress(SendSketch.java:284) at sketch.SendSketch.<init>(SendSketch.java:200) at sketch.SendSketch.main(SendSketch.java:50) Loaded 1 sketch in 7.015 seconds. Query: scaffolds.drop_bacteria.fasta DB: RefSeq SketchLen: 52788 Seqs: 738 Bases: 82339741 gSize: 55423446 File: scaffolds.drop_bacteria.fasta WKID KID ANI Complt Contam Matches Unique TaxID gSize gSeqs taxName 2.81% 0.01% 87.88% 100.00% 0.26% 5 5 2856 188927 2 Cylindrotheca closterium 0.04% 0.00% 70.87% 4.72% 0.62% 3 2 75366 1135M 31277 Sinocyclocheilus grahami 0.01% 0.01% 67.31% 100.00% 0.26% 3 2 556484 25154K 88 Phaeodactylum tricornutum CCAP 1055/1 0.01% 0.01% 66.69% 64.57% 0.31% 4 0 121224 85172K 1882 Pediculus humanus corporis 0.01% 0.01% 66.79% 100.00% 0.26% 3 1 296543 30694K 64 Thalassiosira pseudonana CCMP1335 0.01% 0.00% 67.16% 31.41% 0.41% 3 1 6412 174854K 1991 Helobdella robusta 0.01% 0.00% 67.15% 31.71% 0.41% 3 0 743375 173229K 1483 Polistes dominula Total Time: 7.746 seconds.  #### Use the sketch from sendsketch.sh with sendsketch.sh again (Query: contig_1) This also was the expected result (metagenomics_env) -bash-4.1$ sendsketch.sh in=scaffolds.drop_bacteria.fasta.outsketch
Warning!  Cannot find blacklist_refseq_species_250.sketch /usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/metagenomics_env/opt/bbmap-38.58-0/current/blacklist_refseq_species_250.sketch
java.lang.Exception
at dna.Data.findPath(Data.java:1244)
at sketch.Blacklist.refseqBlacklist(Blacklist.java:155)
at sketch.SendSketch.<init>(SendSketch.java:200)
at sketch.SendSketch.main(SendSketch.java:50)
Loaded 1 sketch in 0.046 seconds.

Query: scaffolds.drop_bacteria.fasta    DB: RefSeq  SketchLen: 52788    Seqs: 738   Bases: 82339741 gSize: 55423446 File: scaffolds.drop_bacteria.fasta
WKID    KID ANI Complt  Contam  Matches Unique  TaxID   gSize   gSeqs   taxName
2.81%   0.01%   87.88%  100.00% 0.26%   5   5   2856    188927  2   Cylindrotheca closterium
0.04%   0.00%   70.87%  4.72%   0.62%   3   2   75366   1135M   31277   Sinocyclocheilus grahami
0.01%   0.01%   67.31%  100.00% 0.26%   3   2   556484  25154K  88  Phaeodactylum tricornutum CCAP 1055/1
0.01%   0.01%   66.69%  64.57%  0.31%   4   0   121224  85172K  1882    Pediculus humanus corporis
0.01%   0.01%   66.79%  100.00% 0.26%   3   1   296543  30694K  64  Thalassiosira pseudonana CCMP1335
0.01%   0.00%   67.16%  31.41%  0.41%   3   1   6412    174854K 1991    Helobdella robusta
0.01%   0.00%   67.15%  31.71%  0.41%   3   0   743375  173229K 1483    Polistes dominula

Total Time:     0.780 seconds.

bbmap kmer sketch bbtools bbsuite • 2.4k views
1
Entering edit mode

sketch.sh and sendsketch.sh are two different programs with different uses. Have you looked at the user guide: bbmap/docs/guides/BBSketchGuide.txt

sketch.sh

Description:  Creates one or more sketches from a fasta file,
optionally annotated with taxonomic information.


sendsketch.sh

Description:  Compares query sketches to reference sketches hosted on a
remote server via the Internet.  The input can be sketches made by sketch.sh,
or fasta/fastq files from which SendSketch will generate sketches.
Only sketches will sent, not sequences.

0
Entering edit mode

I haven’t been able to find where that installs with Conda. What I was trying to show was that the sketch made by sketch.sh was different than the sketch created by sendsketch.sh with outsketch argument.

0
Entering edit mode

The document should be in miniconda3/envs/bbmap/opt/bbmap-38.58-0/docs/guides/BBSketchGuide.txt.

0
Entering edit mode

Thanks, I found it for my conda environment. Does my question make sense tho?

1
Entering edit mode
2.2 years ago

The important thing here is:

Query: scaffolds.drop_bacteria.fasta DB: RefSeq SketchLen: 52788 Seqs: 738 Bases: 82339741

In every case it used all 738 sequences and 82Mbp, and made one sketch, though the sketch size was different in the two cases since the RefSeq server uses larger-than-normal sketches for greater sensitivity, so SendSketch will generate larger than normal sketches when the address is (the default) RefSeq server. However, in different modes sometimes the metadata gets generated differently if it's not fully specified. Specifically, sketch.sh kind of assumes everything has full taxonomic information since it's intended to be used for making references, while SendSketch assumes you don't have any taxonomic information, since why else would you run it? So "query" falls back to something else; in one case it's the filename and in the other it's the name of the first record. I'll try to make them more consistent to reduce confusion.

In both cases the number of hits is really low so the query doesn't seem to be all that closely related to any reference - it's basically at the very limit of BBSketch's sensitivity with these settings. The problem is not really BBSketch, though, but the fact that RefSeq's representation of Cylindrotheca closterium only has 188927 bp... you might get a better hit with a large sketch size:

sendsketch.sh in=scaffolds.drop_bacteria.fasta sizemult=30


Unfortunately, most JGI web services, including the Sketch servers, are down until ~Tuesday due to electrical work.

0
Entering edit mode

Oh wow, I guess that solves my issue from https://stackoverflow.com/questions/57224929/how-can-i-fix-a-unable-to-find-valid-certification-path-to-requested-target-er and my comment on C: BBSketch - A Tool for Rapid Sequence Comparison. So is RefSeq the default database that is searched or does it scour all of the databases except for those that are blacklisted?

# Download the taxonomy db
/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/metagenomics_env/opt/bbmap-38.58-0/pipelines/fetchTaxonomy.sh
mkdir taxonomy_db; mv * taxonomy_db
# Create the NT database to run sendsketch.sh locally (to replicate the exact settings from the server)
/usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/metagenomics_env/opt/bbmap-38.58-0/pipelines/fetchNt.sh taxpath=taxonomy_db


but I should probably wait until the servers are active again.

0
Entering edit mode

OK, all servers are back up again!

RefSeq is the default server. You can specify: refseq, protein, nt, or ribo (aka silva). It will only connect to one server each time you run it.

0
Entering edit mode

Everything is up and running again. Is there a way to download the database used for this? I tried following the fetch* script but I wasn't sure if it was the correct way.

0
Entering edit mode

I just revised all of the fetch scripts in 38.61b which I released today. You should be able to do this:

Run fetchTaxonomy.sh Then modify fetchNt.sh to change the line TAXPATH="auto" to point to wherever you downloaded the taxonomy data. Then run fetchNt.sh.

That will give you nt which is not as good as refseq for this purpose since the references are less complete, but it has the advantage of being a lot smaller. You can then run: comparesketch.sh in=query.fa taxa*.sketch

Or, you can download RefSeq, which will take longer, with fetchRefSeq.sh and then sketchRefSeq.sh.

0
Entering edit mode

Thanks, I'm trying out the refseq download right now. I noticed that when I run sendsketch.sh using qsub on our server, it hangs. Do you know if there is another setting I need to adjust when using a remote server?

0
Entering edit mode

If your cluster nodes don't have direct internet access (or no access at all) then that would be expected.

0
Entering edit mode

I just checked right now and the cluster has internet access. I was able to download a remote URL. Do you know why my process could be hanging? I'm trying to either: (1) Run this on a a remote server; or (2) Download the database so I can run locally (not preferred). Currently, I'm stuck on both approaches.

0
Entering edit mode

I've successfully downloaded the taxonomy db:  /usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/metagenomics_env/opt/bbmap-38.58-0/pipelines/fetchTaxonomy.sh  I've tried the following:

nano /usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/metagenomics_env/opt/bbmap-38.58-0/pipelines/fetchRefSeq.sh
taxpath=/usr/local/scratch/METAGENOMICS/jespinoz/db/bbtools_db/taxonomy_db


Then I ran the following:  /usr/local/devel/ANNOTATION/jespinoz/anaconda3/envs/metagenomics_env/opt/bbmap-38.58-0/pipelines/fetchRefSeq.sh  and received this error:

https://pastebin.com/PZZjqwDC