COBS Indexing genomes - understand the output
1
0
Entering edit mode
22 months ago
davidmaimoun ▴ 50

Hi, I am in charge to find presence of specific genes in 600.000 Salmonella's genomes. Some people advised me to use COBS for the indexing So I used it on few genomes just for training

But I don't really understand the output...

I copied a subsequence (55 bp) from one of my genomes, and run COBS to see if it get it. In the output I got 24 (see bellow).

**output:**
SRR18349609 24
SRR18349610 24
SRR18349611 24

Is that mean it got 24 hits on my 55bp query?

And Is it possible to get more information in the output, like e/p value, location etc.. (like blastn)

Thank you for all!!

indexing cobs genomes COBS • 899 views
ADD COMMENT
2
Entering edit mode
22 months ago

An examlple

$ seqkit  stats t.fasta 
file     format  type  num_seqs  sum_len  min_len  avg_len  max_len
t.fasta  FASTA   DNA          1       60       60       60       60

$ more t.fasta.cobs 
*test   31409
SAMEA18844918   30
SAMN08865341    30

test is the query ID, and31409 is the number of matched records in the COBS index. 30 is the matched k-mers (k=31 by default in COBS), 30 = 60 - 31 + 1 means 100% matched.

For a query of 50 bp, the maximal number of k-mers is 55-31+1 = 25, and SRR18349609 got 24 k-mers matched.

ADD COMMENT
1
Entering edit mode

You may also try KMCP which uses an index structure similar to COBS while with a faster searching speed. While the speedup is not obvious in the 661K dataset cause there are a huge number of similar genomes which results in too many hits for a query, therefore writing results becomes the performance bottleneck.

$ zcat t.fasta.kmcp.tsv.gz  | head -n 4 | column -t 
#query  qLen  qKmers  FPR         hits   target                 chunkIdx  chunks  tLen     kSize  mKmers  qCov    tCov    jacc    queryIdx
test    60    30      2.7536e-05  31406  SAMEA18844918.contigs  0         1       518641   31     30      1.0000  0.0001  0.0001  0
test    60    30      2.7536e-05  31406  SAMN08865341.contigs   0         1       526349   31     30      1.0000  0.0001  0.0001  0
test    60    30      2.7536e-05  31406  SAMEA1462952.contigs   0         1       1199925  31     30      1.0000  0.0000  0.0000  0
ADD REPLY
0
Entering edit mode

It is looks like great! I'll try it!

Thank you very much

ADD REPLY
0
Entering edit mode

Thanks it is very helpful!

And if there is multiple matchs in the same genome (repeated sequences), it will change the output?

ADD REPLY
1
Entering edit mode

No, these kinds of methods can't tell the number of sequence copies or sequence locations. They just tell whether the query is contained in the subject, and there are some false positives (see BIGSI/COBS paper).

ADD REPLY
0
Entering edit mode

God bless you

Thank you for all!

ADD REPLY

Login before adding your answer.

Traffic: 2807 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6