Question: How can I calculate the number of sequences starting with different nucleotides using perl
0
gravatar for MAPK
5 days ago by
MAPK1.1k
United States
MAPK1.1k wrote:

I am not familiar with perl (I mostly work with R) and I was hoping someone would help me with this problem. I have this test.fasta file below. I need to get this table from this fasta file where I have sequence length starting from minimum sequence length to maximum sequence length. Then I want to get the counts for sequences with the start bases ( first base in 5') along with the number of sequences for that given length. For example, In the given fasta below, seq1, seq5 and seq6 are 12 bases long, so in the result table below I have length 12, number of sequences of 12 bases length = 3 and the sequences starting with A= 2 and sequences starting with T =1 and so forth. How do I get this done in perl? Thank you for your help.

test.fasta

>seq1
AATTGGTTTGTT
>seq2
AATTGTGGGTGGTTGT
>seq3
TGGTTTGGGTGGTAA
>seq4
TTGGGGTAAAAAAATTTAA
>seq5
TATTGGTTTGTT
>seq6
AAGTGGTTTGTT

Result I want (shown only for the sequence of length 12bases).

length number A   T   G   C
12      3     2   1    0  0
perl fasta • 140 views
ADD COMMENTlink modified 5 days ago by Neilfws47k • written 5 days ago by MAPK1.1k
3
gravatar for lieven.sterck
5 days ago by
lieven.sterck510
Belgium, Ghent, VIB
lieven.sterck510 wrote:

this script will do it:

#!/usr/bin/env perl
#
use strict;
use warnings;

my %counts;

while (<STDIN>){
        next if ($_ =~ /^>/ || $_ =~ /^\n/);
        chomp;
        my $len = length($_);
        my $fb = substr($_,0,1);
        $counts{$len}{'count'}++;
        if (!$counts{$len}{'A'}) {$counts{$len}{'A'} =0;}
        if (!$counts{$len}{'C'}) {$counts{$len}{'C'} =0;}
        if (!$counts{$len}{'G'}) {$counts{$len}{'G'} =0;}
        if (!$counts{$len}{'T'}) {$counts{$len}{'T'} =0;}
        $counts{$len}{$fb}++;

}

print "length\tnumber\tA\tT\tG\tC\n";
foreach my $c (sort { $a <=> $b } keys %counts){
        print "$c\t$counts{$c}{'count'}\t$counts{$c}{'A'}\t$counts{$c}{'T'}\t$counts{$c}{'G'}\t$counts{$c}{'C'}\n";
}

exit;

reads from STDIN outputs to STDOUT, execute as

perl script.pl < yourFile
ADD COMMENTlink modified 4 days ago • written 5 days ago by lieven.sterck510

Thank you so much. How do I get the sorted values in the table (shortest to longest sequence length)?

ADD REPLYlink written 5 days ago by MAPK1.1k
1

Look into unix sort. You can sort on columns.

ADD REPLYlink written 5 days ago by genomax42k
1

I updated the script such that it now prints the list from shortest to longest.

or as genomax says : use unix sort (will allow you to sort on whatever column you want)

ADD REPLYlink modified 4 days ago • written 4 days ago by lieven.sterck510
3
gravatar for Neilfws
5 days ago by
Neilfws47k
Sydney, Australia
Neilfws47k wrote:

As you mentioned that you normally work in R, I thought an R-based solution might be of interest.

Good packages for working with sequences are Bioconductor's Biostrings and the older seqinr.

Using seqinr, you can read in a fasta-formatted file like so:

library(seqinr)
myseqs <- read.fasta("test.fasta")

Then you can use functions from seqinr to get the sequence length and first character. I'd use the tidyverse packages dplyr and tidyr to manipulate the results. One issue is that both seqinr and dplyr have a count() function, so you need to specify dplyr::count() explicitly.

An example. In this case counts are not returned for C or G since no sequences in the test set start with those bases. You could fix that up later if desired, along with the NA values (zero may be preferred). I would also be inclined to leave the output in "long" format (_i.e._ don't use spread), but you may want it to be "wide" for a reason.

tibble(names = names(myseqs), 
       len = getLength(myseqs), 
       first = sapply(myseqs, function(x) getFrag.character(x, 1, 1))) %>% 
dplyr::count(len, first) %>% 
group_by(len) %>% 
mutate(total = sum(n)) %>% 
spread(first, n)

# A tibble: 4 x 4
# Groups:   len [4]
    len total     a     t
  <int> <int> <int> <int>
1    12     3     2     1
2    15     1    NA     1
3    16     1     1    NA
4    19     1    NA     1
ADD COMMENTlink written 5 days ago by Neilfws47k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 505 users visited in the last hour