I have three samples,they are small RNAs and these three or more samples were sequenced by Solexa high-thoughput sequencing.They are fasta format,the length of reads are 18-28nt.I want to statistics the number of identical sequence between different samples . their names are a.fa,b.fa,c.fa,
For example.enter link description here
a.fa
>1
ATCG
>2
ACTG
>3
GAAG
>4
ATCG
b.fa
>1
GAAG
>2
ATCG
c.fa
>1
ACTG
>2
AAG
The result is like below.
sequence a b c
ATCG 2 1 0
ACTG 1 0 1
GAAG 1 1 0
AAG 0 0 1
I don't think your question is answerable in it's current form - because you haven't really asked a question. What exactly is the question? Please offer more background information, maybe a little context so we can understand what you're trying to solve. You have a set number of sequences, and you have a process which places selections of those sequences into files - then you could ask the likelihood of any file getting some number of identical sequences. What is the pool that sequecnes are drawn from? Will your results matrix be a complete representation of all the sequences available?
Yes. Originally there was no mention of samples, and no description of any experiment, thus it was impossible to understand the any parameters for answering a question. Thanks 2011101101 for updating it, the question, and the answer, are obvious now :)
Using SQLITE3:
the following bash script creates a table containing a hash of the fasta sequence as well as a count of each fasta file for each sequence:
#!/bin/bash
echo "begin transaction;"
i=0;
echo "create table fasta(hash varchar(50) unique,sequence text"
i=0
for F in $@
do
((i++))
echo -n ",total${i} int default 0 "
done
echo ");"
i=0
for F in $@
do
((i++))
tr -d "\r " < $F | awk 'BEGIN {N=0;} /^>/{printf("\n");next;} {printf("%s",toupper($0));} END { printf("\n");}'| egrep -v -E '^$' | while read S
do
echo -n "$S" | sha1sum | awk -v S="${S}" -v i=${i} '{printf("insert or ignore into fasta(hash,sequence) values(\"%s\",\"%s\");\n",$1,S);printf("update fasta set total%s=total%s+1 where hash=\"%s\";\n",i,i,$1);}'
done
done
echo "select * from fasta;"
echo "end;"
Thank you.I have another question.If there are identical sequence in the one sample,can it calculate the number of the sequence.If
you don't understand my mean,I have edit my question.
If you take some time to understand Pierre's script, you'll see it generates exactly the output asked for in your question; so yes, it will count sequences per sample.
If you use Perl, this is comparitively simple (and much faster):
#!/usr/bin/env perl
use strict;
use warnings;
use autodie qw(open);
die "I need some fasta files!" unless @ARGV;
my @files = @ARGV;
my %motifs;
for my $file (@ARGV) {
open(my $in, '<', $file);
while (<$in>) {
chomp;
next if /^>/;
$motifs{$_}{$file}++;
}
close($in);
}
print join "\t", 'motif', @files, "\n";
for my $motif (sort keys %motifs) {
print "$motif";
for my $file (@files) {
my $cnt = 0;
$cnt = $motifs{$motif}{$file} if (defined $motifs{$motif}{$file});
print "\t$cnt";
}
print "\n";
}
#!/usr/bin/perl
use strict;
use warnings;
my %motifs;
my @files = @ARGV;
foreach my $file (@files) {
open F, "$file" or die;
while (<F>) {
next if (/>/);
chomp;
$motifs{$_}{$file}++;
}
close F;
}
print join "\t", 'motif', @files;
print "\n";
foreach my $motif (sort keys %motifs) {
print "$motif";
foreach my $file (@files) {
my $cnt = 0;
$cnt = $motifs{$motif}{$file} if (defined $motifs{$motif}{$file});
print "\t$cnt";
}
print "\n";
}
I don't think your question is answerable in it's current form - because you haven't really asked a question. What exactly is the question? Please offer more background information, maybe a little context so we can understand what you're trying to solve. You have a set number of sequences, and you have a process which places selections of those sequences into files - then you could ask the likelihood of any file getting some number of identical sequences. What is the pool that sequecnes are drawn from? Will your results matrix be a complete representation of all the sequences available?
Perhaps question has been edited since this comment? It seems reasonably clear to me; count sequences given multiple Fasta files as input.
Yes. Originally there was no mention of samples, and no description of any experiment, thus it was impossible to understand the any parameters for answering a question. Thanks 2011101101 for updating it, the question, and the answer, are obvious now :)
I have edited it.