Creating kraken2 custom database
1
0
Entering edit mode
25 days ago

I am trying to create a custom kraken2 DB based on Alveolata sequences that I fetched from NCBI, I even simplified it by including just the first one, for example:

>NW_027179382.1 GTAACCCGGTTGACTCTGCCGGTAGTATATGCTTGTCTCAAAGATTAAGCCATGCATGCGAAAGTATAAG ACTTTATACGTCGAAACCGCAGACGGCTCATTAAAACAGTCATGATCTACACGCATATTGATCACACGGC TAACCGTGGTAATTCTGGGGATAATACGTGCAGCTTCGGCTACTCTTTTTCAGAGTTGTTGTAGAAATCA GCATTCACACTATCACCATTTGAATAAGTCTACAATTCAATTGCTTGTCAATGATGCGTTTGAATATCTG ATCTATCAGTTCTGACGGTAGTGTAGTGGACTACCGTGACTGTAACGGATAACGGAGAATTAGGGTTCGA TTCCGGAGAAGGAGCCTTAAAAACAGCTACTACATCTAAGGAAGGCAGCAGGCGCGCAAATTGCTCAATG AAGGTCATTCGAAGCAGTGACAAGAAATATCAAAGCCAGCTTTCAGCTCGCTATTGATCTGAGGGTAATT TAAAAACTTACTCGATTATTATTGGATCGCTAGTGGGGTGCCAGCCGGAGCGGTAATACCTCCTCCAATA GTGTATGCTAAAATTGTTGCAGTTAAAACGCTCGTAGTCGTAGTTTCTTGACACTTTCAGCATGCCTAAC

Then I do:

kraken2-build --add-to-library alveolata.fasta --db Alveolata
kraken2-build --download-taxonomy --db Alveolata --threads 32
kraken2-build --build --db Alveolata --threads 32

But then the database seems empty:

bash-5.1$ kraken2-inspect --db Alveolata
kraken2-inspect --db Alveolata
Database options: nucleotide db, k = 35, l = 31
Spaced mask = 11111111111111111111111111111111110011001100110011001100110011
Toggle mask = 1110001101111110001010001100010000100111000110110101101000101101
Total taxonomy nodes: 116
Table size: 0
Table capacity: 12068
Min clear hash value = 0

What am I missing?

kraken2 • 615 views
ADD COMMENT
0
Entering edit mode

What am I missing?

Try adding the taxonomy explicitly (from the manual) :

Sequences not downloaded from NCBI may need their taxonomy information assigned explicitly. This can be done using the string kraken:taxid|XXX in the sequence ID, with XXX replaced by the desired taxon ID. For example, to put a known adapter sequence in taxon 32630 ("synthetic construct")


>sequence16|kraken:taxid|32630  Adapter sequence
CAAGCAGAAGACGGCATACGAGATCTTCGAGTGACTGGAGTTCCTTGGCACCCGAGAATTCCA

The kraken:taxid string must begin the sequence ID or be immediately preceded by a pipe character (|). Explicit assignment of taxonomy IDs in this manner will override the accession number mapping provided by NCBI.

Edit: This is not required as long as the headers contain NCBI accession numbers.

ADD REPLY
0
Entering edit mode

Thank you. However, Kraken2 is supposed to fetch the taxa information from NCBI. From the manual:

Each sequence's ID (the string between the > and the first whitespace character on the header line) must contain either an NCBI accession number to allow Kraken 2 to lookup the correct taxa, or an explicit assignment of the taxonomy ID using kraken:taxid (see below).

How can I get this to work, or does it really not work then?

ADD REPLY
1
Entering edit mode
25 days ago
GenoMax 152k

Looks like the process does (should) work. Using the example sequence you provide above (NW_027179382.1) I was able to make the database. Check to see if you ended up with corrupt data for taxonomy.

$ kraken2-build --build --db test
Creating sequence ID to taxonomy ID map (step 1)...
Found 1/1 targets, searched through 242806973 accession IDs, search complete.
Sequence ID to taxonomy ID map complete. [25.843s]
Estimating required capacity (step 2)...
Estimated hash table requirement: 48272 bytes
Capacity estimation complete. [0.018s]
Building database files (step 3)...
Taxonomy parsed and converted.
 .......

$ kraken2-inspect --db test
# Database options: nucleotide db, k = 35, l = 31
# Spaced mask = 11111111111111111111111111111111110011001100110011001100110011
# Toggle mask = 1110001101111110001010001100010000100111000110110101101000101101
# Total taxonomy nodes: 13
# Table size: 186
# Table capacity: 12068
# Min clear hash value = 0
100.00  186     0       R       1       root
100.00  186     0       R1      131567    cellular organisms
100.00  186     0       R2      2759        Eukaryota
100.00  186     0       R3      2698737       Sar
100.00  186     0       R4      33630           Alveolata
100.00  186     0       P       5794              Apicomplexa
100.00  186     0       C       1280412             Conoidasida
100.00  186     0       C1      35086                 Gregarinasina
100.00  186     0       O       35087                   Eugregarinorida
100.00  186     0       F       947094                    Porosporidae
100.00  186     0       G       947096                      Porospora
100.00  186     186     S       2853592                       Porospora cf. gigantea B
ADD COMMENT
0
Entering edit mode

Thank you, it still does not work. I am using Kraken version 2.1.3 and I am just doing:

kraken2-build --add-to-library deleteme.fasta --db Alveolata
kraken2-build --download-taxonomy --db Alveolata
kraken2-build --build --db Alveolata --threads 32

Is that what you did?

ADD REPLY
0
Entering edit mode

Yes. Same version of kraken. Did not use threads since this is a single sequence.

Did the following complete without errors for you? You may want to delete the database folder and try again.

$ kraken2-build --download-taxonomy --db test 
Downloading nucleotide gb accession to taxon map... done.
Downloading nucleotide wgs accession to taxon map... done.
Downloaded accession to taxon map(s)
Downloading taxonomy tree data... done.
Uncompressing taxonomy data... done.
Untarring taxonomy tree data... done.
ADD REPLY
0
Entering edit mode

Thank you, yes, each of steps finishes without problems, and I have ran this several times from scratch:

Masking low-complexity regions of new file... done.
Added "deleteme.fasta" to library (test)
Downloading nucleotide gb accession to taxon map... done.
Downloading nucleotide wgs accession to taxon map... done.
Downloaded accession to taxon map(s)
Downloading taxonomy tree data... done.
Uncompressing taxonomy data... done.
Untarring taxonomy tree data... done.
Creating sequence ID to taxonomy ID map (step 1)...
Found 1/1 targets, searched through 242806973 accession IDs, search complete.
Sequence ID to taxonomy ID map complete. [12.977s]
Estimating required capacity (step 2)...
Estimated hash table requirement: 48272 bytes
Capacity estimation complete. [0.011s]
Building database files (step 3)...
Taxonomy parsed and converted.
CHT created with 4 bits reserved for taxid.
Completed processing of 0 sequences, 0 bp
Writing data to disk...  complete.
Database files completed. [4.332s]
Database construction complete. [Total: 17.341s]

I will try installing kraken again...

ADD REPLY
0
Entering edit mode

Found 1/1 targets, searched through 242806973 accession IDs, search complete.

So it is able to find that accession but is not able to complete the processing of sequence.

Completed processing of 0 sequences, 0 bp

ADD REPLY

Login before adding your answer.

Traffic: 1276 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6