counting motifs in dna sequence
2
2
Entering edit mode
7.4 years ago

Hi,

I have these sequences:

GCAGGCATAGTCGGAACTGCTCTAAGCCTATTAATTCGAGCTGAGCTAAGCCAGCCTGGGGCTCTGCTCGGAGATGA AGTGGGCTTGTTGGGACTGGTCTTTCTTTATTAATTCGTTTTGAGTTAGGCACTGTTGGAGTTTTATTAG---ATAA GCAGGAATAGTTGGAACCGCCCTTAGCTTATTAATTCGAGCAGAACTCAGCCAACCTGGTGCCTTATTAGGGGATGA GCTGGCATAGTAGGAACTGCCCTTAGCCTTTTAATTCGAGCAGAGCTCAGTCAACCCGGAGCCCTGCTCGGAGATGA GCAGGAATAGTTGGAACTGCACTAAGCCTTTTAATTCGAGCTGAACTAAGCCAACCCGGAGCATTACTTGGAGACGA

They would be actually longer, but right not it is not important.

I would like to estimate a given value for the sequences given the number of motif/s in. I would like to count a (number of) motif/s like "ATCGCGCGCGCTTTAAA" in a given sequence, and then use that number to estimate a value for that sequence. I know that you can use a logical question to ask whether a given sequence has a motif but I would like to count them.

Thanks

sequence motif counting • 4.0k views
ADD COMMENT
0
Entering edit mode

Do you have actual motif or just a string, ATCGCGCGCGCTTTAAA ?

ADD REPLY
0
Entering edit mode

Biostrings package in R provides a beautiful function for you purpose:gregexpr2

ADD REPLY
4
Entering edit mode
7.4 years ago

The sequences you pasted seemed aligned sequences:

$ cat seqs.txt 
GCAGGCATAGTCGGAACTGCTCTAAGCCTATTAATTCGAGCTGAGCTAAGCCAGCCTGGGGCTCTGCTCGGAGATGA
AGTGGGCTTGTTGGGACTGGTCTTTCTTTATTAATTCGTTTTGAGTTAGGCACTGTTGGAGTTTTATTAG---ATAA
GCAGGAATAGTTGGAACCGCCCTTAGCTTATTAATTCGAGCAGAACTCAGCCAACCTGGTGCCTTATTAGGGGATGA
GCTGGCATAGTAGGAACTGCCCTTAGCCTTTTAATTCGAGCAGAGCTCAGTCAACCCGGAGCCCTGCTCGGAGATGA
GCAGGAATAGTTGGAACTGCACTAAGCCTTTTAATTCGAGCTGAACTAAGCCAACCCGGAGCATTACTTGGAGACGA

Let's do this using seqkit locate (usage) and csvtk freq (usage) in one-line command:

$ cat -n seqs.txt | sed 's/ //g' | seqkit tab2fx | seqkit seq -g | seqkit locate -i -d -p ATC | csvtk -t freq -k -f 1
seqID   frequency
1       1
2       1
3       1
4       1

Both SeqKit and csvtk provide executable binary files for Linux/Windows/Mac OS, so you can just download and run without any compilation and configuration.

Let me explain this step by step:

Step 1: converting your input sequences to FASTA format and remove gaps (-):

$ cat -n seqs.txt | sed 's/ //g' | seqkit tab2fx | seqkit seq -g 
>1
GCAGGCATAGTCGGAACTGCTCTAAGCCTATTAATTCGAGCTGAGCTAAGCCAGCCTGGG
GCTCTGCTCGGAGATGA
>2
AGTGGGCTTGTTGGGACTGGTCTTTCTTTATTAATTCGTTTTGAGTTAGGCACTGTTGGA
GTTTTATTAGATAA

Step 2: searching and locating your sequence motif (e.g., ATC):

$ cat -n seqs.txt | sed 's/ //g' | seqkit tab2fx | seqkit seq -g | seqkit locate -i -d -p ATC | csvtk -t pretty
seqID   patternName   pattern   strand   start   end   matched
1       ATC           ATC       -        73      75    ATC
2       ATC           ATC       -        70      72    ATC
3       ATC           ATC       -        73      75    ATC
4       ATC           ATC       -        73      75    ATC

Notes:

  • Flag -d/--degenerate is on, so you can search motif containing degenerate base, e.g., ACTGNNNNCCC.
  • Command csvtk -t pretty is used to align table data.

If you have lots of sequence motifs, you can save them to file in FASTA format, and use this:

$ cat motifs.fa
>motif1
ATC
>motif2
actgn


$ cat -n seqs.txt | sed 's/ //g' | seqkit tab2fx | seqkit seq -g | seqkit locate -i -d -f motifs.fa | csvtk -t pretty
seqID   patternName   pattern   strand   start   end   matched
1       motif2        actgn     +        16      20    ACTGC
1       motif1        ATC       -        73      75    ATC
2       motif2        actgn     +        16      20    ACTGG
2       motif2        actgn     +        52      56    ACTGT
2       motif1        ATC       -        70      72    ATC
3       motif1        ATC       -        73      75    ATC
4       motif1        ATC       -        73      75    ATC
4       motif2        actgn     +        16      20    ACTGC
4       motif2        actgn     -        47      51    ACTGA
5       motif2        actgn     +        16      20    ACTGC

Step 3: counting the number of motifs in every sequences with csvtk freq

$  cat -n seqs.txt | sed 's/ //g' | seqkit tab2fx | seqkit seq -g | seqkit locate -p ATC | csvtk -t freq -k -f 1
seqID   frequency
1       1
2       1
3       1
4       1

For case of multiple motifs:

$ cat -n seqs.txt | sed 's/ //g' | seqkit tab2fx | seqkit seq -g | seqkit locate -i -d -f motifs.fa | csvtk -t freq -k -f 1,2 | csvtk -t pretty
seqID   patternName   frequency
1       motif1        1
1       motif2        1
2       motif1        1
2       motif2        2
3       motif1        1
4       motif1        1
4       motif2        2
5       motif2        1
ADD COMMENT
0
Entering edit mode

Please download the latest version (v0.4.0 +). I've just fixed a bug of seqkit locate. e.g., only find two locations (1-4, 7-10, missing 4-7) of ACGA in ACGACGACGA.

$ echo -ne ">seq\nACGACGACGA\n" | seqkit locate -i -p acga | column -t
seqID  patternName  pattern  strand  start  end  matched
seq    acga         acga     +       1      4    ACGA
seq    acga         acga     +       4      7    ACGA   ## missing in previous version
seq    acga         acga     +       7      10   ACGA
ADD REPLY
1
Entering edit mode
7.4 years ago

You may use the dreg emboss tool or seqkit grep and probably many others. Even grep will work once you remove the new lines

will put the whole genome on one line, so don't do it for large genomes:

 cat fasta.fa | tr -d '\n' | grep -o "ATGC"
ADD COMMENT
0
Entering edit mode

Hi, Istvan, it should be seqkit locate not seqkit grep

ADD REPLY
0
Entering edit mode

Ah ok, thanks for the correction.

ADD REPLY

Login before adding your answer.

Traffic: 2934 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6