Question: Counting base and nucleotide frequency of multifasta file
0
gravatar for saadleeshehreen
10 months ago by
saadleeshehreen70 wrote:

Hi, I have a mulifasta with 2000 sequences. The file is like this.

>spacer_1
ATCCCGGGGGGTTTA...............
>spacer_2
TCAGGTTT.......
.
.

I want to count how many bases for each of them and what is the frequency of nucleotide (A,T,G,C) in each of the sequence. I tried this one, but it gave total base count whereas I want a count for each sequence.

grep -v ">" file.fasta | wc | awk '{print $3-$1}'

Any script for this purpose?

Cheers

ADD COMMENTlink modified 10 months ago by Firas0 • written 10 months ago by saadleeshehreen70
1

You can use bioawk (bioawk -c fastx) to get this done.

ADD REPLYlink modified 10 months ago • written 10 months ago by RamRS24k
2
gravatar for JC
10 months ago by
JC8.7k
Mexico
JC8.7k wrote:

It can be done with Perl:

#!/usr/bin/perl

use strict;
use warnings;

my %seqs;
$/ = "\n>"; # read fasta by sequence, not by lines

while (<>) {
    s/>//g;
    my ($seq_id, @seq) = split (/\n/, $_);
    my $seq = uc(join "", @seq); # rebuild sequence as a single string
    my $len = length $seq;
    my $numA = $seq =~ tr/A//; # removing A's from sequence returns total counts
    my $numC = $seq =~ tr/C//;
    my $numG = $seq =~ tr/G//;
    my $numT = $seq =~ tr/T//;
    print "$seq_id: Size=$len  A=$numA  C=$numC  G=$numG  T=$numT\n";
}

Testing it:

$ perl count.pl < seqs.fa
spacer_1: Size=15 A=2  C=3  G=6  T=4
spacer_2: Size=8 A=1  C=1  G=2  T=4
ADD COMMENTlink written 10 months ago by JC8.7k
0
gravatar for Firas
10 months ago by
Firas0
Firas0 wrote:

Using shell

while read line; do echo $line | grep -v '>' | grep -o "[ACGT]" | sort | uniq -c; \
echo $line | grep '>' ; done < file.fasta

The result:

>spacer_1
      2 A
      3 C
      6 G
      4 T
>spacer_2
      1 A
      1 C
      2 G
      4 T

Or use

while read line; do echo $line | grep -v '>' | grep -o "[ACGT]" | sort | uniq -c \
| paste - - - - ; echo $line | grep '>' | tr "\n" "\t" ; done < file.fasta

for a more convenient output

>spacer_1         2 A         3 C         6 G         4 T
>spacer_2         1 A         1 C         2 G         4 T
ADD COMMENTlink modified 10 months ago • written 10 months ago by Firas0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1155 users visited in the last hour