Question: Counting base and nucleotide frequency of multifasta file
0
gravatar for saadleeshehreen
21 months ago by
saadleeshehreen70 wrote:

Hi, I have a mulifasta with 2000 sequences. The file is like this.

>spacer_1
ATCCCGGGGGGTTTA...............
>spacer_2
TCAGGTTT.......
.
.

I want to count how many bases for each of them and what is the frequency of nucleotide (A,T,G,C) in each of the sequence. I tried this one, but it gave total base count whereas I want a count for each sequence.

grep -v ">" file.fasta | wc | awk '{print $3-$1}'

Any script for this purpose?

Cheers

ADD COMMENTlink modified 21 months ago by FX0 • written 21 months ago by saadleeshehreen70
1

You can use bioawk (bioawk -c fastx) to get this done.

ADD REPLYlink modified 21 months ago • written 21 months ago by RamRS30k
3
gravatar for JC
21 months ago by
JC11k
Mexico
JC11k wrote:

It can be done with Perl:

#!/usr/bin/perl

use strict;
use warnings;

my %seqs;
$/ = "\n>"; # read fasta by sequence, not by lines

while (<>) {
    s/>//g;
    my ($seq_id, @seq) = split (/\n/, $_);
    my $seq = uc(join "", @seq); # rebuild sequence as a single string
    my $len = length $seq;
    my $numA = $seq =~ tr/A//; # removing A's from sequence returns total counts
    my $numC = $seq =~ tr/C//;
    my $numG = $seq =~ tr/G//;
    my $numT = $seq =~ tr/T//;
    print "$seq_id: Size=$len  A=$numA  C=$numC  G=$numG  T=$numT\n";
}

Testing it:

$ perl count.pl < seqs.fa
spacer_1: Size=15 A=2  C=3  G=6  T=4
spacer_2: Size=8 A=1  C=1  G=2  T=4
ADD COMMENTlink written 21 months ago by JC11k
0
gravatar for FX
21 months ago by
FX0
FX0 wrote:

Using shell

while read line; do echo $line | grep -v '>' | grep -o "[ACGT]" | sort | uniq -c; \
echo $line | grep '>' ; done < file.fasta

The result:

>spacer_1
      2 A
      3 C
      6 G
      4 T
>spacer_2
      1 A
      1 C
      2 G
      4 T

Or use

while read line; do echo $line | grep -v '>' | grep -o "[ACGT]" | sort | uniq -c \
| paste - - - - ; echo $line | grep '>' | tr "\n" "\t" ; done < file.fasta

for a more convenient output

>spacer_1         2 A         3 C         6 G         4 T
>spacer_2         1 A         1 C         2 G         4 T
ADD COMMENTlink modified 21 months ago • written 21 months ago by FX0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 589 users visited in the last hour