Is there a tool that could be helpful to statistic kmers including invalid characters? For example, given a sequence "ANTTA", if K=3, then there will be three kmers which is "ANT","NTT" and "TTA".
Is this a homework? Which way do you prefer, script or existed tools?
Why do you want do this?
If you'd like to write script, just sliding the sequences with window of K and step of 1, and check every window (kmer) whether it contains non-ACGT characters.
Thanks for your kind reply, It's not a homework, I am gonna statistic Kmers including invalid characters of a large reference library which may contain hundreds of species, and its size could be 200 gigabytes or even larger. The K might be 33 or larger. I have used Jellyfish, however it drops Kmers with invalid characters. Is SeqKit efficient to do it?
5.5 minutes on my laptop for human genome chr1 (~250 Mb, 248,956,422). It will take 67 hours for 200Gb data~.
The bottle neck is that there are too many kmers :-P
$ time seqkit head -n 1 dataset_B.fa \
| seqkit sliding -s 1 -W 33 \
| seqkit seq -s -w 0 \
| grep -c '[^ACGTacgt]'
18479833
seqkit head -n 1 dataset_B.fa 0.74s user 0.42s system 96% cpu 1.204 total
seqkit sliding -s 1 -W 33 329.17s user 6.58s system 101% cpu 5:31.36 total
seqkit seq -s -w 0 137.60s user 11.08s system 44% cpu 5:31.30 total
grep -c '[^ACGTacgt]' 83.15s user 4.60s system 26% cpu 5:31.29 total
A much much faster way by pre-retrieving regions containing non-ACGT chars. (SeqKit v0.5.2-dev or later version needed).
The result is a little different: 18,472,858 < 18,479,833.
But only 47s for genome chr1 now!!!! It will take only 10 hours for you 200Gb data.
$ time seqkit head -n 1 dataset_B.fa \
| seqkit locate -P -p '[^ACGTacgt]+' -G \
| sed 1d | cut -f 1,7 | seqkit tab2fx \
| seqkit sliding -s 1 -W 33 \
| seqkit seq -s -w 0 \
| grep -c '[^ACGTacgt]'
18472858
seqkit head -n 1 dataset_B.fa 0.70s user 0.44s system 94% cpu 1.202 total
seqkit locate -P -p '[^ACGTacgt]+' -G 17.58s user 0.35s system 96% cpu 18.671 total
sed 1d 0.01s user 0.04s system 0% cpu 18.629 total
cut -f 1,7 0.06s user 0.02s system 0% cpu 18.626 total
seqkit tab2fx 0.09s user 0.07s system 0% cpu 46.613 total
seqkit sliding -s 1 -W 33 27.77s user 0.58s system 60% cpu 46.788 total
seqkit seq -s -w 0 10.72s user 0.93s system 24% cpu 46.778 total
grep -c '[^ACGTacgt]' 11.90s user 0.35s system 26% cpu 46.776 total
[Update 2] But it's still slow when the input sequences are very large (100Mb+), a much faster way is pre-retrieving sequence regions containing non-ACGT charactors and estimate using method above. (SeqKit v0.5.2-dev or later version needed).
Is this a homework? Which way do you prefer, script or existed tools?
Why do you want do this?
If you'd like to write script, just sliding the sequences with window of
Kand step of1, and check every window (kmer) whether it contains non-ACGTcharacters.Thanks for your kind reply, It's not a homework, I am gonna statistic Kmers including invalid characters of a large reference library which may contain hundreds of species, and its size could be 200 gigabytes or even larger. The K might be 33 or larger. I have used Jellyfish, however it drops Kmers with invalid characters. Is SeqKit efficient to do it?
5.5 minutes on my laptop for human genome chr1 (~250 Mb, 248,956,422). It will take 67 hours for 200Gb data~. The bottle neck is that there are too many kmers :-P
A much much faster way by pre-retrieving regions containing non-ACGT chars. (SeqKit v0.5.2-dev or later version needed).
The result is a little different:
18,472,858 < 18,479,833.But only 47s for genome chr1 now!!!! It will take only 10 hours for you 200Gb data.
shenwei, Thanks for your help, I will try seqkit.
Would you please upvote/accept my answer if it helps and works. :)