Question

Statistics The Number Of Identical Sequence Between Different Samples

0

Entering edit mode

11.2 years ago

2011101101 ▴ 110

I have three samples,they are small RNAs and these three or more samples were sequenced by Solexa high-thoughput sequencing.They are fasta format,the length of reads are 18-28nt.I want to statistics the number of identical sequence between different samples . their names are a.fa,b.fa,c.fa, For example.enter link description here

a.fa

>1
ATCG
>2
ACTG
>3
GAAG
>4
ATCG

b.fa

>1
GAAG
>2
 ATCG

c.fa

>1
ACTG
>2
AAG

The result is like below.

sequence a b c
ATCG     2 1 0
ACTG     1 0 1
GAAG     1 1 0
AAG      0 0 1

rna sequence • 3.0k views

ADD COMMENT • link updated 11.2 years ago by SES 8.6k • written 11.2 years ago by 2011101101 ▴ 110

2

Entering edit mode

I don't think your question is answerable in it's current form - because you haven't really asked a question. What exactly is the question? Please offer more background information, maybe a little context so we can understand what you're trying to solve. You have a set number of sequences, and you have a process which places selections of those sequences into files - then you could ask the likelihood of any file getting some number of identical sequences. What is the pool that sequecnes are drawn from? Will your results matrix be a complete representation of all the sequences available?

ADD REPLY • link 11.2 years ago by seidel 11k

1

Entering edit mode

Perhaps question has been edited since this comment? It seems reasonably clear to me; count sequences given multiple Fasta files as input.

ADD REPLY • link 11.2 years ago by Neilfws 49k

0

Entering edit mode

Yes. Originally there was no mention of samples, and no description of any experiment, thus it was impossible to understand the any parameters for answering a question. Thanks 2011101101 for updating it, the question, and the answer, are obvious now :)

ADD REPLY • link 11.2 years ago by seidel 11k

0

Entering edit mode

I have edited it.

ADD REPLY • link 11.2 years ago by 2011101101 ▴ 110

score 5 · Answer 1 · 2013-02-18

Using SQLITE3: the following bash script creates a table containing a hash of the fasta sequence as well as a count of each fasta file for each sequence:

#!/bin/bash

echo "begin transaction;"

i=0;
echo "create table fasta(hash varchar(50) unique,sequence text"
i=0
for F in $@
do
    ((i++))
    echo -n ",total${i} int default 0 "
done
echo ");"


i=0
for F in $@
do
    ((i++))
    tr -d "\r " < $F |  awk 'BEGIN {N=0;} /^>/{printf("\n");next;} {printf("%s",toupper($0));} END { printf("\n");}'| egrep -v -E '^$' | while read S
    do
        echo -n "$S" | sha1sum  | awk -v S="${S}" -v i=${i} '{printf("insert or ignore into fasta(hash,sequence) values(\"%s\",\"%s\");\n",$1,S);printf("update fasta set total%s=total%s+1 where hash=\"%s\";\n",i,i,$1);}'
    done
done

echo "select * from fasta;"
echo "end;"

usage:

bash test.sh 1.fa  1.fa 2.fa | sqlite3 jeter.sqlite | head -n 3
3e61b123f4ca692f41e21c36d93beb02f35b8888|GCAATGTCACGTTGTCAACTCATTGCAAACAAAGACTGGACCAACATTGAACGGAGTGATTGGAAGACAATCCGGACAAGTTGCTGGATTTGACTACTCGGCTGCCAACAAGAATAAAGGAGTTGTATGGGACAGACAAACACTTTTCGACTATTTGGCTGATCCGAAAAAGTACATCCCCGGAACTAAGATGGTCTTCGCTGGTTTGAAGAAAGCTGACCAAACGAGCTGATCTCATCAAATTTATTGAAGTGGAAGCTGCCAAGAAACCATCGGCATAAGCCTCTACTAAATAAGAA|1|1|0
a23414c514d69c9e600e098aef1212361672a482|TCATGTTGGCTTCTCGGGGTTTTTATGGATTAATACATTTTCCAAACGATTCTTTGCGCCTTCTGTGGTGCCGCCTTCTCCGAAGGAACTGACGAAAAATGACGTGGATTTGCTGACAAATCCAGGCGAGGAATATTTGGACGGATTGATGAAATGGCACGGCGACGAGCGACCCGTGTTCAAAAGAGAGGACATTTATCGTTGGTCGGATAGTTTTCCAGAATATCGGCTAAGAATGATTTGTCTGAAAGACACGACAAGGGTCATTGCAGTCGGTCAATATTGTTACTTTGATGCTCTGAAAGAAAGGAGAGCAGCCATTGTTCTTCTTAGGATTGGGATGGACGGATCCTGAATATCGTAATCGGGCAGTTATGGAGCTTCAAGCTTCGATGGCGCTGGAGGAGAGGGATCGGTATCCGACTGCCAACGCGGCATCGCATCCAAATAAGTTCATGAAACGATTTTGGCACATATTCAACGGCCTCAAAGAGCACGAGGACAAAGGTCACAAGGCTGCCGCTGTTTCATACAAGAGCTTCTACGACCTCANAGACATGATCATTCCTGAAAATCTGGATGTCAGTGGTATTACTGTAAATGATGCACGAAAGGTGCCACAAAGAGATATAATCAACTACGATCAAACATTTCATCCATATCATCGAGAAATGGTTATAATTTCTCACATGTATGACAATGATGGGTTTGGAAAAGTGCGTATGATGAGGATGGAAATGTACTTGGAATTGTCTAGCGATGTCTTTANACCAACAAGACTGCACATTAGTCAATTATGCAGATAGCC|1|1|0
91178681f34b1bf66b911b4d80c896b33f2aa5a3|AATCTGTACATCTTCAATTGTGGTTCACTTCTTCTATCGTCTTGTTCGAGAAAACCACGGAGAAAAGGAGCAAGACCGTGGATTGAAAGACACCAAAGAAACCGCCAAGGATGTGCTGGGTTTTGTAAAAATGCTTGGAATAATCCTAGCTATGGTTGTAGGCTTTGCCTTGTTGGGGTTTGTCACGTTTTATCTCTATCAGTATGCGAG|1|1|0

score 4 · Answer 2 · 2013-02-18

4

Entering edit mode

11.2 years ago

SES 8.6k

If you use Perl, this is comparitively simple (and much faster):

#!/usr/bin/env perl

use strict;
use warnings;
use autodie qw(open);

die "I need some fasta files!" unless @ARGV;

my @files = @ARGV;
my %motifs;
for my $file (@ARGV) {
    open(my $in, '<', $file);
    while (<$in>) {
        chomp;
        next if /^>/;
        $motifs{$_}{$file}++;
    }
    close($in);
}


print join "\t", 'motif', @files, "\n";
for my $motif (sort keys %motifs) {
    print "$motif";
    for my $file (@files) {
        my $cnt = 0;
        $cnt = $motifs{$motif}{$file} if (defined $motifs{$motif}{$file});
        print "\t$cnt";
    }
    print "\n";
}

Just list your files after the script, i.e.

perl biostars63968.pl a.fa b.fa c.fa

This produces:

motif    a.fa    b.fa    c.fa    
AAG     0    0    1
ACTG    1    0    1
ATCG    2    1    0
GAAG    1    1    0

You can now import the results into a spreadsheet or some other program like R.

EDIT: Modified based on JC's comment so it will produce the desired table of counts, instead of just columns of motif counts per file.

ADD COMMENT • link 11.2 years ago by SES 8.6k

1

Entering edit mode

I will modify your code to output the table:

#!/usr/bin/perl

use strict;
use warnings;

my %motifs;
my @files = @ARGV;

foreach my $file (@files) {
    open F, "$file" or die;
    while (<F>) {
        next if (/>/);
        chomp;
        $motifs{$_}{$file}++;
    }
    close F;
}

print join "\t", 'motif', @files;
print "\n";
foreach my $motif (sort keys %motifs) {
    print "$motif";
    foreach my $file (@files) {
        my $cnt = 0;
        $cnt = $motifs{$motif}{$file} if (defined $motifs{$motif}{$file});
        print "\t$cnt";
    }
    print "\n";
}

ADD REPLY • link 11.2 years ago by JC 13k

0

Entering edit mode

Nice! I'll update my answer to output the table.

ADD REPLY • link 11.2 years ago by SES 8.6k

0

Entering edit mode

Yes,it's very perfect !Thank you very much .

ADD REPLY • link 11.2 years ago by 2011101101 ▴ 110