Question

COBS Indexing genomes - understand the output

0

Entering edit mode

22 months ago

davidmaimoun ▴ 50

Hi, I am in charge to find presence of specific genes in 600.000 Salmonella's genomes. Some people advised me to use COBS for the indexing So I used it on few genomes just for training

But I don't really understand the output...

I copied a subsequence (55 bp) from one of my genomes, and run COBS to see if it get it. In the output I got 24 (see bellow).

**output:**
SRR18349609 24
SRR18349610 24
SRR18349611 24

Is that mean it got 24 hits on my 55bp query?

And Is it possible to get more information in the output, like e/p value, location etc.. (like blastn)

Thank you for all!!

indexing cobs genomes COBS • 899 views

ADD COMMENT • link 22 months ago by davidmaimoun ▴ 50

score 2 · Accepted Answer · 2022-06-21

2

Entering edit mode

22 months ago

shenwei356 8.4k

An examlple

$ seqkit  stats t.fasta 
file     format  type  num_seqs  sum_len  min_len  avg_len  max_len
t.fasta  FASTA   DNA          1       60       60       60       60

$ more t.fasta.cobs 
*test   31409
SAMEA18844918   30
SAMN08865341    30

test is the query ID, and31409 is the number of matched records in the COBS index. 30 is the matched k-mers (k=31 by default in COBS), 30 = 60 - 31 + 1 means 100% matched.

For a query of 50 bp, the maximal number of k-mers is 55-31+1 = 25, and SRR18349609 got 24 k-mers matched.

ADD COMMENT • link 22 months ago by shenwei356 8.4k

1

Entering edit mode

You may also try KMCP which uses an index structure similar to COBS while with a faster searching speed. While the speedup is not obvious in the 661K dataset cause there are a huge number of similar genomes which results in too many hits for a query, therefore writing results becomes the performance bottleneck.

$ zcat t.fasta.kmcp.tsv.gz  | head -n 4 | column -t 
#query  qLen  qKmers  FPR         hits   target                 chunkIdx  chunks  tLen     kSize  mKmers  qCov    tCov    jacc    queryIdx
test    60    30      2.7536e-05  31406  SAMEA18844918.contigs  0         1       518641   31     30      1.0000  0.0001  0.0001  0
test    60    30      2.7536e-05  31406  SAMN08865341.contigs   0         1       526349   31     30      1.0000  0.0001  0.0001  0
test    60    30      2.7536e-05  31406  SAMEA1462952.contigs   0         1       1199925  31     30      1.0000  0.0000  0.0000  0

ADD REPLY • link 22 months ago by shenwei356 8.4k

0

Entering edit mode

It is looks like great! I'll try it!

Thank you very much

ADD REPLY • link 22 months ago by davidmaimoun ▴ 50

0

Entering edit mode

Thanks it is very helpful!

And if there is multiple matchs in the same genome (repeated sequences), it will change the output?

ADD REPLY • link 22 months ago by davidmaimoun ▴ 50

1

Entering edit mode

No, these kinds of methods can't tell the number of sequence copies or sequence locations. They just tell whether the query is contained in the subject, and there are some false positives (see BIGSI/COBS paper).

ADD REPLY • link 22 months ago by shenwei356 8.4k

0

Entering edit mode

God bless you

Thank you for all!

ADD REPLY • link 22 months ago by davidmaimoun ▴ 50