Question: How to count the frequency of letter in a sequence file?
0
10 months ago by
nimishabalan13110 wrote:

I have a fasta file (seq.fasta) containing multiple sequences;

``````    >seq1
ATGCGTCTCCCCTTTAGAGAGTTCTCTCTAGCTACGTA
ATTTTTATCGCGCGGGGTGCGACGTTTTTAGGGGGGGG
>seq2
ATCTCTNNNNNNNNNNATATCCCCTTTNNNNNCTCTCT
ATTTTTTTTTCCCCCCGCGCGCGATCGACGCCCCCCCC
>seq3
ATCTCTNNNNNNNNNNATATCCCCTTCTCGGGGCCCCT
NNNNNTTTTTCTCTCTCGCGCTCGTCGAAAAATGCCCC
``````

How to count the frequency of 'N' and the number of positions this pattern has been occurring? (ATCTCT "NNNNNNNNNN" ATATCCCCTTT "NNNNN" CTCTCT).

The result should be No. of occurrences of 'N' and number of positions this pattern has been seen per sequence

``````           Output

seq1,0,0
seq2,15,2
seq3,15,2

(\$id=seq1, No_of_N's=0, frequency_pattern=0
\$id=seq2, No_of_N's=15, frequency_pattern=2
\$id=seq3, No_of_N's=15, frequency_pattern=2)
``````
sequence genome • 357 views
modified 10 months ago • written 10 months ago by nimishabalan13110
1

I have changed your post to a Question, as it is asking for help and not providing a Tutorial.

Please can you tell us what you've done so far? Also why do you need this information?

1

What have you tried? Which programming language do you want to use? Have you searched online for suitable tools?

We are volunteers and want to put you on the right track, but we don't want to invest a lot of our time to provide you with a ready to use solution.

1

with seqkit, awk and datamash & not printing sequences with zero pattern:

``````\$ seqkit locate -Pp 'N+' test.fa | awk -v OFS="\t" 'NR==1 {print \$0,"length"}; NR!=1 {print \$0,\$6-\$5}'| datamash -sH -g 1 sum 8 count 8 --full
seqID   patternName pattern strand  start   end matched length  sum(length) count(length)
seq2    N+  N+  +   7   16  NNNNNNNNNN  9   13  2
seq3    N+  N+  +   7   16  NNNNNNNNNN  9   13  2
``````

Sounds like a homework assignement.

Use python `count()` method. Or just go through all your letters one by one.

2
10 months ago by
nimishabalan13110 wrote:

``````            \$ gawk '
BEGIN {
RS=">seq[^\n]+"
}
NR>1 {
# gsub(/\n/,"")  # UNCOMMENT THIS IF NEWLINE SEPARATED PATTERN IS ONE PATTERN
printf "%s=%d,%d\n",rt,gsub(/N/,"N"),gsub(/N+/,"")
}
{
rt=RT
}' file_name
``````
1

Thanks for sharing the solution!