How to count the frequency of letter in a sequence file?
1
0
Entering edit mode
4.3 years ago

I have a fasta file (seq.fasta) containing multiple sequences;

    >seq1 
    ATGCGTCTCCCCTTTAGAGAGTTCTCTCTAGCTACGTA
    ATTTTTATCGCGCGGGGTGCGACGTTTTTAGGGGGGGG
    >seq2
    ATCTCTNNNNNNNNNNATATCCCCTTTNNNNNCTCTCT
    ATTTTTTTTTCCCCCCGCGCGCGATCGACGCCCCCCCC
    >seq3
    ATCTCTNNNNNNNNNNATATCCCCTTCTCGGGGCCCCT
    NNNNNTTTTTCTCTCTCGCGCTCGTCGAAAAATGCCCC

How to count the frequency of 'N' and the number of positions this pattern has been occurring? (ATCTCT "NNNNNNNNNN" ATATCCCCTTT "NNNNN" CTCTCT).

The result should be No. of occurrences of 'N' and number of positions this pattern has been seen per sequence

           Output

          seq1,0,0
          seq2,15,2
          seq3,15,2

         ($id=seq1, No_of_N's=0, frequency_pattern=0
          $id=seq2, No_of_N's=15, frequency_pattern=2 
          $id=seq3, No_of_N's=15, frequency_pattern=2)
sequence genome • 1.2k views
ADD COMMENT
1
Entering edit mode

I have changed your post to a Question, as it is asking for help and not providing a Tutorial.

Please can you tell us what you've done so far? Also why do you need this information?

ADD REPLY
1
Entering edit mode

What have you tried? Which programming language do you want to use? Have you searched online for suitable tools?

We are volunteers and want to put you on the right track, but we don't want to invest a lot of our time to provide you with a ready to use solution.

ADD REPLY
1
Entering edit mode

with seqkit, awk and datamash & not printing sequences with zero pattern:

$ seqkit locate -Pp 'N+' test.fa | awk -v OFS="\t" 'NR==1 {print $0,"length"}; NR!=1 {print $0,$6-$5}'| datamash -sH -g 1 sum 8 count 8 --full
seqID   patternName pattern strand  start   end matched length  sum(length) count(length)
seq2    N+  N+  +   7   16  NNNNNNNNNN  9   13  2
seq3    N+  N+  +   7   16  NNNNNNNNNN  9   13  2
ADD REPLY
0
Entering edit mode

Sounds like a homework assignement.

Use python count() method. Or just go through all your letters one by one.

ADD REPLY
2
Entering edit mode
4.3 years ago

I got the answer

            $ gawk '
            BEGIN {
                RS=">seq[^\n]+"
             }
            NR>1 {
                # gsub(/\n/,"")  # UNCOMMENT THIS IF NEWLINE SEPARATED PATTERN IS ONE PATTERN 
                printf "%s=%d,%d\n",rt,gsub(/N/,"N"),gsub(/N+/,"")
             }
             {
                rt=RT
             }' file_name
ADD COMMENT
1
Entering edit mode

Thanks for sharing the solution!

ADD REPLY

Login before adding your answer.

Traffic: 3012 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6