Question: How to count the frequency of letter in a sequence file?
0
gravatar for nimishabalan131
10 months ago by
nimishabalan13110 wrote:

I have a fasta file (seq.fasta) containing multiple sequences;

    >seq1 
    ATGCGTCTCCCCTTTAGAGAGTTCTCTCTAGCTACGTA
    ATTTTTATCGCGCGGGGTGCGACGTTTTTAGGGGGGGG
    >seq2
    ATCTCTNNNNNNNNNNATATCCCCTTTNNNNNCTCTCT
    ATTTTTTTTTCCCCCCGCGCGCGATCGACGCCCCCCCC
    >seq3
    ATCTCTNNNNNNNNNNATATCCCCTTCTCGGGGCCCCT
    NNNNNTTTTTCTCTCTCGCGCTCGTCGAAAAATGCCCC

How to count the frequency of 'N' and the number of positions this pattern has been occurring? (ATCTCT "NNNNNNNNNN" ATATCCCCTTT "NNNNN" CTCTCT).

The result should be No. of occurrences of 'N' and number of positions this pattern has been seen per sequence

           Output

          seq1,0,0
          seq2,15,2
          seq3,15,2

         ($id=seq1, No_of_N's=0, frequency_pattern=0
          $id=seq2, No_of_N's=15, frequency_pattern=2 
          $id=seq3, No_of_N's=15, frequency_pattern=2)
sequence genome • 357 views
ADD COMMENTlink modified 10 months ago • written 10 months ago by nimishabalan13110
1

I have changed your post to a Question, as it is asking for help and not providing a Tutorial.

Please can you tell us what you've done so far? Also why do you need this information?

ADD REPLYlink written 10 months ago by Emily_Ensembl21k
1

What have you tried? Which programming language do you want to use? Have you searched online for suitable tools?

We are volunteers and want to put you on the right track, but we don't want to invest a lot of our time to provide you with a ready to use solution.

ADD REPLYlink written 10 months ago by WouterDeCoster44k
1

with seqkit, awk and datamash & not printing sequences with zero pattern:

$ seqkit locate -Pp 'N+' test.fa | awk -v OFS="\t" 'NR==1 {print $0,"length"}; NR!=1 {print $0,$6-$5}'| datamash -sH -g 1 sum 8 count 8 --full
seqID   patternName pattern strand  start   end matched length  sum(length) count(length)
seq2    N+  N+  +   7   16  NNNNNNNNNN  9   13  2
seq3    N+  N+  +   7   16  NNNNNNNNNN  9   13  2
ADD REPLYlink modified 10 months ago • written 10 months ago by cpad011214k

Sounds like a homework assignement.

Use python count() method. Or just go through all your letters one by one.

ADD REPLYlink written 10 months ago by German.M.Demidov1.9k
2
gravatar for nimishabalan131
10 months ago by
nimishabalan13110 wrote:

I got the answer

            $ gawk '
            BEGIN {
                RS=">seq[^\n]+"
             }
            NR>1 {
                # gsub(/\n/,"")  # UNCOMMENT THIS IF NEWLINE SEPARATED PATTERN IS ONE PATTERN 
                printf "%s=%d,%d\n",rt,gsub(/N/,"N"),gsub(/N+/,"")
             }
             {
                rt=RT
             }' file_name
ADD COMMENTlink written 10 months ago by nimishabalan13110
1

Thanks for sharing the solution!

ADD REPLYlink written 10 months ago by WouterDeCoster44k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1411 users visited in the last hour