Statistics The Number Of Identical Sequence Between Different Samples
2
0
Entering edit mode
11.2 years ago
2011101101 ▴ 110

I have three samples,they are small RNAs and these three or more samples were sequenced by Solexa high-thoughput sequencing.They are fasta format,the length of reads are 18-28nt.I want to statistics the number of identical sequence between different samples . their names are a.fa,b.fa,c.fa, For example.enter link description here

a.fa

>1
ATCG
>2
ACTG
>3
GAAG
>4
ATCG

b.fa

>1
GAAG
>2
 ATCG

c.fa

>1
ACTG
>2
AAG

The result is like below.

sequence a b c
ATCG     2 1 0
ACTG     1 0 1
GAAG     1 1 0
AAG      0 0 1
rna sequence • 3.0k views
ADD COMMENT
2
Entering edit mode

I don't think your question is answerable in it's current form - because you haven't really asked a question. What exactly is the question? Please offer more background information, maybe a little context so we can understand what you're trying to solve. You have a set number of sequences, and you have a process which places selections of those sequences into files - then you could ask the likelihood of any file getting some number of identical sequences. What is the pool that sequecnes are drawn from? Will your results matrix be a complete representation of all the sequences available?

ADD REPLY
1
Entering edit mode

Perhaps question has been edited since this comment? It seems reasonably clear to me; count sequences given multiple Fasta files as input.

ADD REPLY
0
Entering edit mode

Yes. Originally there was no mention of samples, and no description of any experiment, thus it was impossible to understand the any parameters for answering a question. Thanks 2011101101 for updating it, the question, and the answer, are obvious now :)

ADD REPLY
0
Entering edit mode

I have edited it.

ADD REPLY
5
Entering edit mode
11.2 years ago

Using SQLITE3: the following bash script creates a table containing a hash of the fasta sequence as well as a count of each fasta file for each sequence:

#!/bin/bash

echo "begin transaction;"

i=0;
echo "create table fasta(hash varchar(50) unique,sequence text"
i=0
for F in $@
do
    ((i++))
    echo -n ",total${i} int default 0 "
done
echo ");"


i=0
for F in $@
do
    ((i++))
    tr -d "\r " < $F |  awk 'BEGIN {N=0;} /^>/{printf("\n");next;} {printf("%s",toupper($0));} END { printf("\n");}'| egrep -v -E '^$' | while read S
    do
        echo -n "$S" | sha1sum  | awk -v S="${S}" -v i=${i} '{printf("insert or ignore into fasta(hash,sequence) values(\"%s\",\"%s\");\n",$1,S);printf("update fasta set total%s=total%s+1 where hash=\"%s\";\n",i,i,$1);}'
    done
done

echo "select * from fasta;"
echo "end;"

usage:

bash test.sh 1.fa  1.fa 2.fa | sqlite3 jeter.sqlite | head -n 3
3e61b123f4ca692f41e21c36d93beb02f35b8888|GCAATGTCACGTTGTCAACTCATTGCAAACAAAGACTGGACCAACATTGAACGGAGTGATTGGAAGACAATCCGGACAAGTTGCTGGATTTGACTACTCGGCTGCCAACAAGAATAAAGGAGTTGTATGGGACAGACAAACACTTTTCGACTATTTGGCTGATCCGAAAAAGTACATCCCCGGAACTAAGATGGTCTTCGCTGGTTTGAAGAAAGCTGACCAAACGAGCTGATCTCATCAAATTTATTGAAGTGGAAGCTGCCAAGAAACCATCGGCATAAGCCTCTACTAAATAAGAA|1|1|0
a23414c514d69c9e600e098aef1212361672a482|TCATGTTGGCTTCTCGGGGTTTTTATGGATTAATACATTTTCCAAACGATTCTTTGCGCCTTCTGTGGTGCCGCCTTCTCCGAAGGAACTGACGAAAAATGACGTGGATTTGCTGACAAATCCAGGCGAGGAATATTTGGACGGATTGATGAAATGGCACGGCGACGAGCGACCCGTGTTCAAAAGAGAGGACATTTATCGTTGGTCGGATAGTTTTCCAGAATATCGGCTAAGAATGATTTGTCTGAAAGACACGACAAGGGTCATTGCAGTCGGTCAATATTGTTACTTTGATGCTCTGAAAGAAAGGAGAGCAGCCATTGTTCTTCTTAGGATTGGGATGGACGGATCCTGAATATCGTAATCGGGCAGTTATGGAGCTTCAAGCTTCGATGGCGCTGGAGGAGAGGGATCGGTATCCGACTGCCAACGCGGCATCGCATCCAAATAAGTTCATGAAACGATTTTGGCACATATTCAACGGCCTCAAAGAGCACGAGGACAAAGGTCACAAGGCTGCCGCTGTTTCATACAAGAGCTTCTACGACCTCANAGACATGATCATTCCTGAAAATCTGGATGTCAGTGGTATTACTGTAAATGATGCACGAAAGGTGCCACAAAGAGATATAATCAACTACGATCAAACATTTCATCCATATCATCGAGAAATGGTTATAATTTCTCACATGTATGACAATGATGGGTTTGGAAAAGTGCGTATGATGAGGATGGAAATGTACTTGGAATTGTCTAGCGATGTCTTTANACCAACAAGACTGCACATTAGTCAATTATGCAGATAGCC|1|1|0
91178681f34b1bf66b911b4d80c896b33f2aa5a3|AATCTGTACATCTTCAATTGTGGTTCACTTCTTCTATCGTCTTGTTCGAGAAAACCACGGAGAAAAGGAGCAAGACCGTGGATTGAAAGACACCAAAGAAACCGCCAAGGATGTGCTGGGTTTTGTAAAAATGCTTGGAATAATCCTAGCTATGGTTGTAGGCTTTGCCTTGTTGGGGTTTGTCACGTTTTATCTCTATCAGTATGCGAG|1|1|0
ADD COMMENT
0
Entering edit mode

Thank you.I have another question.If there are identical sequence in the one sample,can it calculate the number of the sequence.If
you don't understand my mean,I have edit my question.

ADD REPLY
1
Entering edit mode

If you take some time to understand Pierre's script, you'll see it generates exactly the output asked for in your question; so yes, it will count sequences per sample.

ADD REPLY
0
Entering edit mode

Thank you ,I will try

ADD REPLY
4
Entering edit mode
11.2 years ago
SES 8.6k

If you use Perl, this is comparitively simple (and much faster):

#!/usr/bin/env perl

use strict;
use warnings;
use autodie qw(open);

die "I need some fasta files!" unless @ARGV;

my @files = @ARGV;
my %motifs;
for my $file (@ARGV) {
    open(my $in, '<', $file);
    while (<$in>) {
        chomp;
        next if /^>/;
        $motifs{$_}{$file}++;
    }
    close($in);
}


print join "\t", 'motif', @files, "\n";
for my $motif (sort keys %motifs) {
    print "$motif";
    for my $file (@files) {
        my $cnt = 0;
        $cnt = $motifs{$motif}{$file} if (defined $motifs{$motif}{$file});
        print "\t$cnt";
    }
    print "\n";
}

Just list your files after the script, i.e.

perl biostars63968.pl a.fa b.fa c.fa

This produces:

motif    a.fa    b.fa    c.fa    
AAG     0    0    1
ACTG    1    0    1
ATCG    2    1    0
GAAG    1    1    0

You can now import the results into a spreadsheet or some other program like R.

EDIT: Modified based on JC's comment so it will produce the desired table of counts, instead of just columns of motif counts per file.

ADD COMMENT
1
Entering edit mode

I will modify your code to output the table:

#!/usr/bin/perl

use strict;
use warnings;

my %motifs;
my @files = @ARGV;

foreach my $file (@files) {
    open F, "$file" or die;
    while (<F>) {
        next if (/>/);
        chomp;
        $motifs{$_}{$file}++;
    }
    close F;
}

print join "\t", 'motif', @files;
print "\n";
foreach my $motif (sort keys %motifs) {
    print "$motif";
    foreach my $file (@files) {
        my $cnt = 0;
        $cnt = $motifs{$motif}{$file} if (defined $motifs{$motif}{$file});
        print "\t$cnt";
    }
    print "\n";
}
ADD REPLY
0
Entering edit mode

Nice! I'll update my answer to output the table.

ADD REPLY
0
Entering edit mode

Yes,it's very perfect !Thank you very much .

ADD REPLY

Login before adding your answer.

Traffic: 1867 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6