Question: Counting N'S Within Fasta
4
7.2 years ago by
PoGibas4.8k
Vilnius
PoGibas4.8k wrote:

My multiple fasta looks like that:

>ptnt1
ATCTAGTCGATTTTTCGTCTAGTnnnnnnnnnnnnnnnnnnnnnCTAGTCCCTATGCCTGATATA
>ptnt2
GGCGGATGGTnnnnnnnnnnnnnnnnnTGTCTGGAnnnnnnnnnnnnnnnnCATGCTAG
>ptnt3
CACTGATTGGGTAACACTTAGTATGCTTAGCCAnnnnnnnnnnnnnn

I would like to:
1.count the length of every sequence;
2.count the ratio of nnnnn & normalACTG (this one is too tricky for me. I don't know how to count n's). Perl preferably (sorry guys forgot to mention this before).

Wanted output would look like (name; length; length of n's; ratio n/ACTG). Hope someone can help me with this.

fasta perl counts • 10k views
modified 13 months ago by Biostar ♦♦ 20 • written 7.2 years ago by PoGibas4.8k

The FASTA format includes > before the sequence name.

15
7.2 years ago by
Damian Kao15k
USA
Damian Kao15k wrote:

You can parse the fasta file with BioPython or whatever custom parser you have. Then use the count method in python to count the occurrence of nucleotides. For example:

#would give you counts of 'N'. I used lower() to convert the sequence to lowercase just in case it's not.
nCount = fastaSequence.lower().count('n')
#would give you length of the sequence
length = len(fastaSequence)
11
7.2 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

EDIT: added line to convert n to N before count.

First, if you want a Perl solution: learn Bioperl and in particular study the Bio::SeqIO module. When you have mastered that, you will be able to read in any FASTA file and do whatever you like with it. In other words, your problem is simply a specific case of the more general problem "How to read a sequence file and do (operation X) to it."

Now, some Bioperl code, assuming that your sequences are in the file seqs.fa:

#!/usr/bin/perl -w
use strict;
use Bio::SeqIO;

my \$seqio = Bio::SeqIO->new(-file => "seqs.fa", -format => 'fasta');
while(my \$seq = \$seqio->next_seq) {
my \$s = \$seq->seq;
\$s =~ s/n/N/g;
my \$id  = \$seq->display_id; # name (ID)
my \$len = \$seq->length; # length
my %count;
\$count{\$_}++ foreach split //, \$s; # count bases
\$count{"N"} = 0 if !exists(\$count{"N"}); # create N = 0 if no N in sequence
my \$ratio = \$count{"N"} / (\$len - \$count{"N"}); # ratio of N to not N
my @output = (\$id, \$len, \$count{"N"}, \$ratio);
print join(",", @output), "\n";
}

ptNt1,65,21,0.477272727272727
ptNt2,59,33,1.26923076923077
ptNt3,47,14,0.424242424242424

I don't think this will work the way it is written, you probably need \$count{"n"}. Also, I don't mean to be critical because the approach is clever, but it does not seem efficient to read the sequence into a hash just to get a count. If you try this on a large number of sequences (or really long sequences), transliterate will be much faster.

You should be critical as you are correct on both points. I changed n to N while testing the code and forgot to change it back - now fixed. And yes, transliteration is better for counting large sequences, provided that altering the string is not a problem.

I assume you mean this by transliteration?

\$countN= \$seq->seq =~ tr/Nn/Nn/;

Also then no need for "\$s =~ s/n/N/g;"

Yes, that's the one.

7
7.2 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum121k wrote:

in C:

#include <stdio.h>
int main(int argc,char** argv)
{
int N=0;
int T=0;
for(;;)
{
int c=fgetc(stdin);
switch(c)
{
case EOF: case '>':
{
if(T>0) printf("\t%d\t%d\t%f\n",N,T,N/(double)T);
if(c==EOF) return 0;
N=0;
T=0;
while((c=fgetc(stdin))!=EOF && c!='\n')
{
fputc(c,stdout);
}
fputc('\t',stdout);
break;
}
case ' ':case '\n': case '\r': break;
case 'N': case 'n': ++N;/* continue */
default: ++T;
}
}
return 0;
}

.

gcc -O3 -Wall input.c

.

./a.out < hg19.fa
chr1        23970000    249250621    0.096168
chr2        4994855    243199373    0.020538
chr3        3225295    198022430    0.016288
chr4        3492600    191154276    0.018271
chr5        3220000    180915260    0.017798
chr6        3720001    171115067    0.021740
chr7        3785000    159138663    0.023784
chr8        3475100    146364022    0.023743
chr9        21070000    141213431    0.149207
chr10        4220009    135534747    0.031136
chr11        3877000    135006516    0.028717
chr12        3370502    133851895    0.025181
chr13        19580000    115169878    0.170010
chr14        19060000    107349540    0.177551
chr15        20836626    102531392    0.203222
chr16        11470000    90354753    0.126944
chr17        3400000    81195210    0.041874
chr18        3420019    78077248    0.043803
chr19        3320000    59128983    0.056148
chr20        3520000    63025520    0.055850
chr21        13023253    48129895    0.270586
chr22        16410021    51304566    0.319855
chrX        4170000    155270560    0.026856
chrY        33720000    59373566    0.567930
chrM        0    16571    0.000000
5
7.2 years ago by
Martin A Hansen3.0k
Denmark
Martin A Hansen3.0k wrote:

Use Biopieces:

Per sequence:

Summary:

4
7.2 years ago by
Anima Mundi2.4k
Italy
Anima Mundi2.4k wrote:

This Python function could help you. As it is, it works as a script.py that calls the function and takes the input from the terminal. If you remove the raw_input statement and the function call (last line) you obtain the function per se. Hope this helps.

def Counter():

#Counter evaluates N nucleotides ratio.

initial_string = raw_input("Enter the string:\n")

print

string = str.upper(initial_string)

n_A = float(string.count("A"))
n_T = float(string.count("T"))
n_C = float(string.count("C"))
n_G = float(string.count("G"))
n_N = float(string.count("N"))

print "A occurred", int(n_A), "times"
print "T occurred", int(n_T), "times"
print "C occurred", int(n_C), "times"
print "G occurred", int(n_G), "times"
print "N occurred", int(n_N), "times"
print

tot_len_string_nucl = n_A + n_T + n_C + n_G + n_N #Counts just the nucleotides.
tot_len_string_all = float(len(string)) #Includes any "garbage" typing.

tot_differ = tot_len_string_all - tot_len_string_nucl

if tot_differ != 0.0:
print "Non-nucleotide typing are present in the sequence. They were ignored"
print "Total (nucleotide) length is", int(tot_len_string_nucl)
else:
print "Total length is", int(tot_len_string_nucl)

print

perc_A = 100*(n_A / tot_len_string_nucl)
perc_T = 100*(n_T / tot_len_string_nucl)
perc_C = 100*(n_C / tot_len_string_nucl)
perc_G = 100*(n_G / tot_len_string_nucl)
perc_N = 100*(n_N / tot_len_string_nucl)

print "%A = " + str(perc_A) + "%"
print "%T = " + str(perc_T) + "%"
print "%C = " + str(perc_C) + "%"
print "%G = " + str(perc_G) + "%"
print "%N = " + str(perc_N) + "%"
print

Counter()
4
7.2 years ago by
SES8.2k
Vancouver, BC
SES8.2k wrote:

Here is another BioPerl example that may be easier to understand. This will print the percentage of Ns also, which is probably more informative than the ratio.

#!/usr/bin/env perl

use strict;
use warnings;
use Bio::SeqIO;

my \$usage = "\$0 infile\n";
my \$in = shift or die \$usage;

my \$seq_in = Bio::SeqIO->new(-file => \$in, -format => 'fasta');

my (\$n_count, \$total, \$seq_ct, \$n_ratio, \$n_ratio_t, \$n_perc, \$n_perc_t) = (0, 0, 0, 0, 0, 0, 0);

print "Seq_ID\tLength\tN_count\tN_ratio\tN_perc\n";

while(my \$seq = \$seq_in->next_seq) {
if (\$seq->length > 0) {
\$seq_ct++;
\$total += \$seq->length;
\$n_count = (\$seq->seq =~ tr/Nn//);
\$n_perc  = sprintf("%.2f",\$n_count/\$seq->length);
\$n_ratio = sprintf("%.2f",\$n_count/(\$seq->length - \$n_count));
\$n_ratio_t += \$n_ratio;
\$n_perc_t  += \$n_perc;
print join("\t",(\$seq->id,\$seq->length,\$n_count,\$n_ratio,\$n_perc)),"\n";
}
}

my \$len_ave     = sprintf("%.0f",\$total/\$seq_ct);
my \$n_ratio_ave = sprintf("%.2f",\$n_ratio_t/\$seq_ct);
my \$n_perc_ave  = sprintf("%.2f",\$n_perc_t/\$seq_ct);

print "\nAverage length, N-ratio and N-percent for \$in :   \$len_ave\t\$n_ratio_ave\t\$n_perc_ave\n";

Assuming your sequences are in a file called "N.fas" save the script as "bp_count_n.pl" (or whatever you want) then execute it:

perl bp_count_n.pl N.fas

and see the results (which you could redirect to a file).

Seq_ID    Length    N_count    N_ratio    N_perc
ptnt1    65    21    0.48    0.32
ptnt2    59    33    1.27    0.56
ptnt3    47    14    0.42    0.30

Average length, N-ratio and N-percent for N.fas : 57    0.72    0.39
4
7.2 years ago by
lh331k
United States
lh331k wrote:

With seqtk:

seqtk comp input.fa | awk '{x=\$3+\$4+\$5+\$6;y=\$2;print \$1,y-x,y,(y-x)/y}'

Identical output to Pierre's for GRCh37. "Seqtk comp" reads gzipped files and both fasta and fastq. It also reports some other statistics.

For long sequences, C can be several times faster than scripts and even faster than Bio*.

3
6.3 years ago by
Deutschland
David Langenberger8.9k wrote:

There is also a tool from UCSC: faCount

Example:

#seq    len    A    C    G    T    N    cpg
ptnt1    65    9    10    7    18    21    2
ptnt2    59    4    4    11    7    33    1
ptnt3    47    9    7    7    10    14    0
total    171    22    21    25    35    68    3

Your question (Wanted output would look like (name; length; length of n's; ratio n/ACTG):

faCount reads.fa | perl -ane 'next if(/\#/); print "\$F[0]\t\$F[1]\t\$F[6]\t".(\$F[6]/(\$F[1]-\$F[6]))."\n";'

ptnt1    65    21    0.477272727272727
ptnt2    59    33    1.26923076923077
ptnt3    47    14    0.424242424242424
total    171    68    0.660194174757282
2
7.2 years ago by
ngsgene350
United States
ngsgene350 wrote:

If you were to use perl without bio perl, using an if (\$l !~ />/) loop in the script can get you to your sequence after the fasta description and length() can give you the length. For counting n's you can split the sequence and increment a hash when an n comes while another hash for counting a/c/g/t

Store all this information in variables for you to print, for name you can use (\$l =~ />(\$1)/) and then grab \$1 and print it for the name, length from length(), length of n's can be determined from the hash, and you can calculate the ratio using both the hashes.

The key is to print the variables without over writing them.

2
6.4 years ago by
fo3c430
.eu
fo3c430 wrote:

Something simpler, in perl as requested, without bio- dependencies.

#!/usr/bin/perl

my (\$h, \$n, \$l);

open(I,\$ARGV[0]) or die(\$!);
while(<I>){
chomp;
if(/^>/){
\$h=substr(\$_,1);
}else{
\$n=(\$_=~tr/nN/nN/);
\$l=length(\$_);
print \$h,"\t",\$l,"\t",\$n,"\t",\$n/(\$l-\$n),"\n";
}
}
close(I);

To use it: \$ perl script_name.pl sequences.fasta