Can anybody explain what this statement says?
2
0
Entering edit mode
4.7 years ago
saranpons3 ▴ 70

Hello members, The following statement is taken from this paper http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-14-160

we should not distinguish between a k-mer and its reversed complement, and by the “canonical k-mer” we will mean the lexicographically smaller of the two.

Can anybody explain the statement in bold with an example? I know what is the meaning of k-mer and counting k-mers in reads data set. But i'm not able to understand the above statement. Thanks in advance

kmer kmer counter • 1.2k views
3
Entering edit mode
4.7 years ago

For example, the 3-mer TAC is actually the reverse complement of the k-mer GTA. So the words reported are printed in a mixture of both the forward and reverse complements. As a sequence is scanned for 3-mers, both counts of forward and reverse complement of the word are calculated, which usually are different from each other, except in the case of reverse palindromes. To save space and for efficiency, the words are only stored once. The choice of whether a word is stored and printed in the forward or reverse direction is determined by alphabetic order. Therefore, GTA is "canoncial 3-mer" and it is also stored as TAC.

0
Entering edit mode

I tried to run BFCounter which is a k-mer counter software on the following fastq data set which contains only 2 reads and each read of length 49 bp. (This data set is a toy data set I'm using to understand BFCounter).

@SRR292770.1 FCB067LABXX:4:1101:1155:2103/1
GGAGTCATCATACGGCGCTGATCGAGACCGCAACGACTTTAAGGTCGCA
+
FFFFCFGDCGGGFCGBGFFFAEGFG;B7A@GEFBFGGFFGFGEFCFFFB
@SRR292770.3 FCB067LABXX:4:1101:1166:2158/1
GGAGTCATCATACGGCGCTGATCGAGACCGCAACGACTTTAAGGTCGCA
+
GFGGDGGFGGGG@GFGGFG@EFDFFEGFDE?>BC9>.:*>8<4</;?A>


When i ran BFCounter on the data set, value, I've chosen for k-mer is 25. As there are 49 bp in a read, total number of k-mers to be generated should be 25 as there are 25 distinct k-mers and I'm getting 25 k-mers from BFCounter. The output of BFCounter is as follows.

AAAAAAAAAAAAGTTGTTCTCGTCC           2
GCGACCTTAAAGTCGTGACGGACGA   2
CGTCCGTCACGACTTTAAGGTCGCA           2
GACCTTAAAGTCGTGACGGACGAGA   2
AAAAAAAAAAAAAAAAGTTGTTCTC           2
AAAAAGTTGTTCTCGTCCGTCACGA           2
AAAAAAAAAAAAAAGTTGTTCTCGT           2
AAAAAAAAAAAGTTGTTCTCGTCCG           2
AAAAAAAAAAGTTGTTCTCGTCCGT           2
AAAAAAAAGTTGTTCTCGTCCGTCA           2
AAAAAAGTTGTTCTCGTCCGTCACG           2
GTTGTTCTCGTCCGTCACGACTTTA           2
AAAAGTTGTTCTCGTCCGTCACGAC           2
AAAAAAAAAAAAAAAGTTGTTCTCG           2
ACCTTAAAGTCGTGACGGACGAGAA   2
AAAGTTGTTCTCGTCCGTCACGACT           2
AAAGTCGTGACGGACGAGAACAACT   2
CCTTAAAGTCGTGACGGACGAGAAC   2
AAAAAAAAAGTTGTTCTCGTCCGTC           2
AAAAAAAGTTGTTCTCGTCCGTCAC           2
AAGTCGTGACGGACGAGAACAACTT   2
AAAAAAAAAAAAAGTTGTTCTCGTC           2
CGACCTTAAAGTCGTGACGGACGAG   2
CTTAAAGTCGTGACGGACGAGAACA   2
TTAAAGTCGTGACGGACGAGAACAA   2


The number of k-mers BFcounter produsing is 25 and it is correct. But when i looked at k-mer content i don't feel they are correct and proper one. Can you tell me why this difference?

1
Entering edit mode

I added markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

0
Entering edit mode

They look suspicious. I'm not familiar with the software. It will be a lot easier to understand an output if you use small k-mers (for example = 1 or 2).

0
Entering edit mode

Could you please explain what do column 2 and 3 mean? I assume they're k-mer counts...

EDIT: @wouter i've seen what you wrote and deleted :D For clarification: I mean column 2 and 3 because I am referring to column 1 as the actual k-mers.

1
Entering edit mode

Actually, in the output produced by BFCounter only two columns are there. 1st column is k-mer and 2nd column is k-mer count. Here 2nd column(k-mer count columns) has come as a 3rd column for few lines due to alignment issue.

0
Entering edit mode
4.7 years ago
Macspider ★ 3.4k

we should not distinguish between a k-mer and its reversed complement

When you count k-mers, you shouldn't count as two different k-mers the ones that are reverse complemented (they might come from the same genomic region).

we will mean the lexicographically smaller of the two

This is a bit awkward to read, since they've the same word size. Waiting for others to kick-in.

EDIT: On a second thought, I think they're trying to state the criterion they superimposed on their scripts to tell the computer which of the two reverse complemented k-mers to keep in this situation. Using the lexico-graphical ordering.

https://en.wikipedia.org/wiki/Lexicographical_order

A note to your post: the title is 3-lines long. Could you edit it keeping only "Question: Can anybody explain what this statement says?". It's informative enough.

0
Entering edit mode

I tried to run BFCounter which is a k-mer counter software on the following fastq data set which contains only 2 reads and each read of length 49 bp. (This data set is a toy data set I'm using to understand BFCounter).

@SRR292770.1 FCB067LABXX:4:1101:1155:2103/1 GGAGTCATCATACGGCGCTGATCGAGACCGCAACGACTTTAAGGTCGCA + FFFFCFGDCGGGFCGBGFFFAEGFG;B7A@GEFBFGGFFGFGEFCFFFB @SRR292770.3 FCB067LABXX:4:1101:1166:2158/1 GGAGTCATCATACGGCGCTGATCGAGACCGCAACGACTTTAAGGTCGCA + GFGGDGGFGGGG@GFGGFG@EFDFFEGFDE?>BC9>.:*>8<4

When i ran BFCounter on the data set, value, I've chosen for k-mer is 25. As there are 49 bp in a read, total number of k-mers to be generated should be 25 as there are 25 distinct k-mers and I'm getting 25 k-mers from BFCounter. The output of BFCounter is as follows.

AAAAAAAAAAAAGTTGTTCTCGTCC 2 GCGACCTTAAAGTCGTGACGGACGA 2 CGTCCGTCACGACTTTAAGGTCGCA 2 GACCTTAAAGTCGTGACGGACGAGA 2 AAAAAAAAAAAAAAAAGTTGTTCTC 2 AAAAAGTTGTTCTCGTCCGTCACGA 2 AAAAAAAAAAAAAAGTTGTTCTCGT 2 AAAAAAAAAAAGTTGTTCTCGTCCG 2 AAAAAAAAAAGTTGTTCTCGTCCGT 2 AAAAAAAAGTTGTTCTCGTCCGTCA 2 AAAAAAGTTGTTCTCGTCCGTCACG 2 GTTGTTCTCGTCCGTCACGACTTTA 2 AAAAGTTGTTCTCGTCCGTCACGAC 2 AAAAAAAAAAAAAAAGTTGTTCTCG 2 ACCTTAAAGTCGTGACGGACGAGAA 2 AAAGTTGTTCTCGTCCGTCACGACT 2 AAAGTCGTGACGGACGAGAACAACT 2 CCTTAAAGTCGTGACGGACGAGAAC 2 AAAAAAAAAGTTGTTCTCGTCCGTC 2 AAAAAAAGTTGTTCTCGTCCGTCAC 2 AAGTCGTGACGGACGAGAACAACTT 2 AAAAAAAAAAAAAGTTGTTCTCGTC 2 CGACCTTAAAGTCGTGACGGACGAG 2 CTTAAAGTCGTGACGGACGAGAACA 2 TTAAAGTCGTGACGGACGAGAACAA 2

The number of k-mers BFcounter produsing is 25 and it is correct. But when i looked at k-mer content i don't feel they are correct and proper one. Can you tell me why this difference?

0
Entering edit mode

You replied with the same reply to @a.zielinski, let's continue there so we avoid splitting the thread into subthreads (#readability). ;)