Question: Counting N'S Within Fasta
gravatar for PoGibas
7.2 years ago by
PoGibas4.8k wrote:

My multiple fasta looks like that:


I would like to:
1.count the length of every sequence;
2.count the ratio of nnnnn & normalACTG (this one is too tricky for me. I don't know how to count n's). Perl preferably (sorry guys forgot to mention this before).

Wanted output would look like (name; length; length of n's; ratio n/ACTG). Hope someone can help me with this.

fasta perl counts • 10k views
ADD COMMENTlink modified 13 months ago by Biostar ♦♦ 20 • written 7.2 years ago by PoGibas4.8k

The FASTA format includes > before the sequence name.

ADD REPLYlink written 7.2 years ago by Martin A Hansen3.0k
gravatar for Damian Kao
7.2 years ago by
Damian Kao15k
Damian Kao15k wrote:

You can parse the fasta file with BioPython or whatever custom parser you have. Then use the count method in python to count the occurrence of nucleotides. For example:

#would give you counts of 'N'. I used lower() to convert the sequence to lowercase just in case it's not.
nCount = fastaSequence.lower().count('n') 
#would give you length of the sequence
length = len(fastaSequence)
ADD COMMENTlink written 7.2 years ago by Damian Kao15k
gravatar for Neilfws
7.2 years ago by
Sydney, Australia
Neilfws48k wrote:

EDIT: added line to convert n to N before count.

First, if you want a Perl solution: learn Bioperl and in particular study the Bio::SeqIO module. When you have mastered that, you will be able to read in any FASTA file and do whatever you like with it. In other words, your problem is simply a specific case of the more general problem "How to read a sequence file and do (operation X) to it."

Now, some Bioperl code, assuming that your sequences are in the file seqs.fa:

#!/usr/bin/perl -w 
use strict;
use Bio::SeqIO;

my $seqio = Bio::SeqIO->new(-file => "seqs.fa", -format => 'fasta');
while(my $seq = $seqio->next_seq) {
  my $s = $seq->seq;
  $s =~ s/n/N/g;
  my $id  = $seq->display_id; # name (ID)
  my $len = $seq->length; # length
  my %count;
  $count{$_}++ foreach split //, $s; # count bases
  $count{"N"} = 0 if !exists($count{"N"}); # create N = 0 if no N in sequence
  my $ratio = $count{"N"} / ($len - $count{"N"}); # ratio of N to not N
  my @output = ($id, $len, $count{"N"}, $ratio);
  print join(",", @output), "\n";

Result using your sequences:

ADD COMMENTlink modified 7.2 years ago • written 7.2 years ago by Neilfws48k

I don't think this will work the way it is written, you probably need $count{"n"}. Also, I don't mean to be critical because the approach is clever, but it does not seem efficient to read the sequence into a hash just to get a count. If you try this on a large number of sequences (or really long sequences), transliterate will be much faster.

ADD REPLYlink written 7.2 years ago by SES8.2k

You should be critical as you are correct on both points. I changed n to N while testing the code and forgot to change it back - now fixed. And yes, transliteration is better for counting large sequences, provided that altering the string is not a problem.

ADD REPLYlink written 7.2 years ago by Neilfws48k

I assume you mean this by transliteration?

$countN= $seq->seq =~ tr/Nn/Nn/;

Also then no need for "$s =~ s/n/N/g;"

ADD REPLYlink written 7.2 years ago by Alastair Kerr5.2k

Yes, that's the one.

ADD REPLYlink written 7.2 years ago by Neilfws48k
gravatar for Pierre Lindenbaum
7.2 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum121k wrote:

in C:

#include <stdio.h>
int main(int argc,char** argv)
    int N=0;
    int T=0;
        int c=fgetc(stdin);
            case EOF: case '>':
                if(T>0) printf("\t%d\t%d\t%f\n",N,T,N/(double)T);
                if(c==EOF) return 0;
                while((c=fgetc(stdin))!=EOF && c!='\n')
            case ' ':case '\n': case '\r': break;
            case 'N': case 'n': ++N;/* continue */
            default: ++T;
    return 0;


gcc -O3 -Wall input.c


./a.out < hg19.fa
chr1        23970000    249250621    0.096168
chr2        4994855    243199373    0.020538
chr3        3225295    198022430    0.016288
chr4        3492600    191154276    0.018271
chr5        3220000    180915260    0.017798
chr6        3720001    171115067    0.021740
chr7        3785000    159138663    0.023784
chr8        3475100    146364022    0.023743
chr9        21070000    141213431    0.149207
chr10        4220009    135534747    0.031136
chr11        3877000    135006516    0.028717
chr12        3370502    133851895    0.025181
chr13        19580000    115169878    0.170010
chr14        19060000    107349540    0.177551
chr15        20836626    102531392    0.203222
chr16        11470000    90354753    0.126944
chr17        3400000    81195210    0.041874
chr18        3420019    78077248    0.043803
chr19        3320000    59128983    0.056148
chr20        3520000    63025520    0.055850
chr21        13023253    48129895    0.270586
chr22        16410021    51304566    0.319855
chrX        4170000    155270560    0.026856
chrY        33720000    59373566    0.567930
chrM        0    16571    0.000000
ADD COMMENTlink written 7.2 years ago by Pierre Lindenbaum121k
gravatar for Martin A Hansen
7.2 years ago by
Martin A Hansen3.0k
Martin A Hansen3.0k wrote:

Use Biopieces:

Per sequence:

read_fasta -i input.fna | analyze_seq


read_fasta -i input.fna | analyze_seq | analyze_vals -k SEQ_LEN,HARD_MASK% -x
ADD COMMENTlink modified 7.2 years ago • written 7.2 years ago by Martin A Hansen3.0k
gravatar for Anima Mundi
7.2 years ago by
Anima Mundi2.4k
Anima Mundi2.4k wrote:

This Python function could help you. As it is, it works as a that calls the function and takes the input from the terminal. If you remove the raw_input statement and the function call (last line) you obtain the function per se. Hope this helps.

def Counter():

    #Counter evaluates N nucleotides ratio.

    initial_string = raw_input("Enter the string:\n")


    string = str.upper(initial_string)

    n_A = float(string.count("A"))
    n_T = float(string.count("T"))
    n_C = float(string.count("C"))
    n_G = float(string.count("G"))
    n_N = float(string.count("N"))

    print "A occurred", int(n_A), "times"
    print "T occurred", int(n_T), "times"
    print "C occurred", int(n_C), "times"
    print "G occurred", int(n_G), "times"
    print "N occurred", int(n_N), "times"

    tot_len_string_nucl = n_A + n_T + n_C + n_G + n_N #Counts just the nucleotides.
    tot_len_string_all = float(len(string)) #Includes any "garbage" typing.

    tot_differ = tot_len_string_all - tot_len_string_nucl

    if tot_differ != 0.0:
        print "Non-nucleotide typing are present in the sequence. They were ignored"
        print "Total (nucleotide) length is", int(tot_len_string_nucl)
        print "Total length is", int(tot_len_string_nucl)


    perc_A = 100*(n_A / tot_len_string_nucl)
    perc_T = 100*(n_T / tot_len_string_nucl)
    perc_C = 100*(n_C / tot_len_string_nucl)
    perc_G = 100*(n_G / tot_len_string_nucl)
    perc_N = 100*(n_N / tot_len_string_nucl)

    print "%A = " + str(perc_A) + "%"
    print "%T = " + str(perc_T) + "%"
    print "%C = " + str(perc_C) + "%"
    print "%G = " + str(perc_G) + "%"
    print "%N = " + str(perc_N) + "%"

ADD COMMENTlink modified 7.2 years ago • written 7.2 years ago by Anima Mundi2.4k
gravatar for SES
7.2 years ago by
Vancouver, BC
SES8.2k wrote:

Here is another BioPerl example that may be easier to understand. This will print the percentage of Ns also, which is probably more informative than the ratio.

#!/usr/bin/env perl

use strict;
use warnings;
use Bio::SeqIO;

my $usage = "$0 infile\n";
my $in = shift or die $usage;

my $seq_in = Bio::SeqIO->new(-file => $in, -format => 'fasta');

my ($n_count, $total, $seq_ct, $n_ratio, $n_ratio_t, $n_perc, $n_perc_t) = (0, 0, 0, 0, 0, 0, 0);

print "Seq_ID\tLength\tN_count\tN_ratio\tN_perc\n";

while(my $seq = $seq_in->next_seq) {
    if ($seq->length > 0) {
        $total += $seq->length;
        $n_count = ($seq->seq =~ tr/Nn//);
        $n_perc  = sprintf("%.2f",$n_count/$seq->length);
        $n_ratio = sprintf("%.2f",$n_count/($seq->length - $n_count));
        $n_ratio_t += $n_ratio;
        $n_perc_t  += $n_perc; 
        print join("\t",($seq->id,$seq->length,$n_count,$n_ratio,$n_perc)),"\n";

my $len_ave     = sprintf("%.0f",$total/$seq_ct);
my $n_ratio_ave = sprintf("%.2f",$n_ratio_t/$seq_ct);
my $n_perc_ave  = sprintf("%.2f",$n_perc_t/$seq_ct);

print "\nAverage length, N-ratio and N-percent for $in :   $len_ave\t$n_ratio_ave\t$n_perc_ave\n";

Assuming your sequences are in a file called "N.fas" save the script as "" (or whatever you want) then execute it:

perl N.fas

and see the results (which you could redirect to a file).

Seq_ID    Length    N_count    N_ratio    N_perc
ptnt1    65    21    0.48    0.32
ptnt2    59    33    1.27    0.56
ptnt3    47    14    0.42    0.30

Average length, N-ratio and N-percent for N.fas : 57    0.72    0.39
ADD COMMENTlink modified 2.8 years ago • written 7.2 years ago by SES8.2k
gravatar for lh3
7.2 years ago by
United States
lh331k wrote:

With seqtk:

seqtk comp input.fa | awk '{x=$3+$4+$5+$6;y=$2;print $1,y-x,y,(y-x)/y}'

Identical output to Pierre's for GRCh37. "Seqtk comp" reads gzipped files and both fasta and fastq. It also reports some other statistics.

For long sequences, C can be several times faster than scripts and even faster than Bio*.

ADD COMMENTlink modified 7.2 years ago • written 7.2 years ago by lh331k
gravatar for David Langenberger
6.3 years ago by
David Langenberger8.9k wrote:

There is also a tool from UCSC: faCount


faCount reads.fa

#seq    len    A    C    G    T    N    cpg
ptnt1    65    9    10    7    18    21    2
ptnt2    59    4    4    11    7    33    1
ptnt3    47    9    7    7    10    14    0
total    171    22    21    25    35    68    3

Your question (Wanted output would look like (name; length; length of n's; ratio n/ACTG):

faCount reads.fa | perl -ane 'next if(/\#/); print "$F[0]\t$F[1]\t$F[6]\t".($F[6]/($F[1]-$F[6]))."\n";'

ptnt1    65    21    0.477272727272727
ptnt2    59    33    1.26923076923077
ptnt3    47    14    0.424242424242424
total    171    68    0.660194174757282
ADD COMMENTlink written 6.3 years ago by David Langenberger8.9k
gravatar for ngsgene
7.2 years ago by
United States
ngsgene350 wrote:

If you were to use perl without bio perl, using an if ($l !~ />/) loop in the script can get you to your sequence after the fasta description and length() can give you the length. For counting n's you can split the sequence and increment a hash when an n comes while another hash for counting a/c/g/t

Store all this information in variables for you to print, for name you can use ($l =~ />($1)/) and then grab $1 and print it for the name, length from length(), length of n's can be determined from the hash, and you can calculate the ratio using both the hashes.

The key is to print the variables without over writing them.

ADD COMMENTlink written 7.2 years ago by ngsgene350
gravatar for fo3c
6.4 years ago by
fo3c430 wrote:

Something simpler, in perl as requested, without bio- dependencies.


my ($h, $n, $l);

open(I,$ARGV[0]) or die($!);
    print $h,"\t",$l,"\t",$n,"\t",$n/($l-$n),"\n";

To use it: $ perl sequences.fasta

ADD COMMENTlink modified 6.4 years ago • written 6.4 years ago by fo3c430
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2053 users visited in the last hour