Question

How to count the frequency of letter in a sequence file?

0

Entering edit mode

4.3 years ago

nimishabalan131 ▴ 20

I have a fasta file (seq.fasta) containing multiple sequences;

    >seq1 
    ATGCGTCTCCCCTTTAGAGAGTTCTCTCTAGCTACGTA
    ATTTTTATCGCGCGGGGTGCGACGTTTTTAGGGGGGGG
    >seq2
    ATCTCTNNNNNNNNNNATATCCCCTTTNNNNNCTCTCT
    ATTTTTTTTTCCCCCCGCGCGCGATCGACGCCCCCCCC
    >seq3
    ATCTCTNNNNNNNNNNATATCCCCTTCTCGGGGCCCCT
    NNNNNTTTTTCTCTCTCGCGCTCGTCGAAAAATGCCCC

How to count the frequency of 'N' and the number of positions this pattern has been occurring? (ATCTCT "NNNNNNNNNN" ATATCCCCTTT "NNNNN" CTCTCT).

The result should be No. of occurrences of 'N' and number of positions this pattern has been seen per sequence

           Output

          seq1,0,0
          seq2,15,2
          seq3,15,2

         ($id=seq1, No_of_N's=0, frequency_pattern=0
          $id=seq2, No_of_N's=15, frequency_pattern=2 
          $id=seq3, No_of_N's=15, frequency_pattern=2)

sequence genome • 1.2k views

ADD COMMENT • link 4.3 years ago by nimishabalan131 ▴ 20

1

Entering edit mode

I have changed your post to a Question, as it is asking for help and not providing a Tutorial.

Please can you tell us what you've done so far? Also why do you need this information?

ADD REPLY • link 4.3 years ago by Emily 23k

1

Entering edit mode

What have you tried? Which programming language do you want to use? Have you searched online for suitable tools?

We are volunteers and want to put you on the right track, but we don't want to invest a lot of our time to provide you with a ready to use solution.

ADD REPLY • link 4.3 years ago by WouterDeCoster 47k

1

Entering edit mode

with seqkit, awk and datamash & not printing sequences with zero pattern:

$ seqkit locate -Pp 'N+' test.fa | awk -v OFS="\t" 'NR==1 {print $0,"length"}; NR!=1 {print $0,$6-$5}'| datamash -sH -g 1 sum 8 count 8 --full
seqID   patternName pattern strand  start   end matched length  sum(length) count(length)
seq2    N+  N+  +   7   16  NNNNNNNNNN  9   13  2
seq3    N+  N+  +   7   16  NNNNNNNNNN  9   13  2

ADD REPLY • link 4.3 years ago by cpad0112 21k

0

Entering edit mode

Sounds like a homework assignement.

Use python count() method. Or just go through all your letters one by one.

ADD REPLY • link 4.3 years ago by German.M.Demidov ★ 2.9k

score 2 · Accepted Answer · 2020-01-03

2

Entering edit mode

4.3 years ago

nimishabalan131 ▴ 20

I got the answer

            $ gawk '
            BEGIN {
                RS=">seq[^\n]+"
             }
            NR>1 {
                # gsub(/\n/,"")  # UNCOMMENT THIS IF NEWLINE SEPARATED PATTERN IS ONE PATTERN 
                printf "%s=%d,%d\n",rt,gsub(/N/,"N"),gsub(/N+/,"")
             }
             {
                rt=RT
             }' file_name