Question: Count Sequences In Multi Fasta File
1
gravatar for empyrean999
6.2 years ago by
empyrean999170
Canada
empyrean999170 wrote:

Hello,

I have 10 fasta files with sequenced reads information with read sizes from 15 - 35 . I have combined the reads and collapsed in to unique reads and filtered for sizes 18 - 26 bp long unique reads. Now i wanted to count each unique read appearance in all the fasta files and make a table with sample names as columns and reads as rows. I tried to use "grep -w "sequence name" file name " to count the tags but this seems to take long time. does anyone know how to do this faster?

Edited: Sorry for the confusion. Here is the input and output

This is the input file where it contains unique sequences. i have more than million such unique sequences.

Query:
>tag1
TCGGA
>tag2
TCTCA
>tag3
TCTCGC

These are multiple files. for example i am showing with 3 files. i have more than 20 such files. each file contains more than 10 million sequences each

File1:
>file1_id1
TCGGA
>file1_id1
TCGGAT
>file1_id2
TCTCA
>file1_id3
TCTCA

File2:
>file2_id1
TCTCA
>file2_id2
TCTCA
>file2_id3
TCTCACTA
>file2_id4
TCTCGC
>file2_id5
TCTCGCCTAT
>file2_id6
TCTCGC

File3:
>file1_id1
TCGGA
>file1_id1
TCGGAT
>file2_id4
TCTCGC
>file2_id5
TCTCGCCTAT
>file2_id6
TCTCGC

I need the following output. Search has to be exact for the count. output:

       sequence      file1      file2      file3
tag1    TCGGA        1        0        1
tag2    TCTCA        2        2        0        
tag3    TCTCGC        0        2        2
fasta perl unix python • 4.3k views
ADD COMMENTlink modified 6.2 years ago by Kenosis1.2k • written 6.2 years ago by empyrean999170
2
gravatar for Neilfws
6.2 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

EDIT: question edited after I posted this answer, so it may not be relevant.

It's not clear from your question what part of the fasta file you are trying to count and what criteria specify a "unique read appearance".

If we assume that reads can appear more than once in the file myfile.fa and that the same read (sequence) will have the same ID - for example:

>id1
acgt...
>id1
acgt....
>id2
ggga...

then you can extract and count IDs like this:

grep -o -E "^>\w+" myfile.fa | tr -d ">" | uniq -c

  2 id1
  1 id2
ADD COMMENTlink modified 6.2 years ago • written 6.2 years ago by Neilfws48k
1
gravatar for Kenosis
6.2 years ago by
Kenosis1.2k
Kenosis1.2k wrote:

Given your datasets--one query file and three seqnence files--the following produces your desired results:

use strict;
use warnings;
use File::Basename;
use Sort::Naturally;

local $/ = '>';
my ( %files, %query, %omit, @F );
$omit{$_}++ for ( $ARGV[0], basename $0);

while (<>) {
    chomp;
    $query{ $F[0] } = $F[1] if @F = split;
}

local $/ = "\n";
my @files = nsort @ARGV = grep !$omit{$_}, <*>;

while (<>) {
    next if /^>/;
    chomp;
    $files{$_}{$ARGV}++;
}

print join( "\t", 'tag', 'seq', @files ), "\n";

for my $key ( nsort keys %query ) {
    my @seqCount;
    push @seqCount, $files{ $query{$key} }->{$_} || 0 for @files;
    print join( "\t", $key, $query{$key}, @seqCount ), "\n";
}

Output on your datasets:

tag    seq    file1    file2    file3
tag1    TCGGA    1    0    1
tag2    TCTCA    2    2    0
tag3    TCTCGC    0    2    2

Place the script file in the same directory as all of your other files, and call it from the command line as follows:

perl script.pl queryFile

The script first processes the queryFile, creating a hash where the keys are the tags and the associated values are the sequences. Next, it reads all the file names in the directory, omitting (via grep) the query and script files. It builds a hash of hashes (HoH) to tally the sequences in the seq files. Finally, it prints the results.

Sort::Naturally is used to keep the alphanumeric tag and file names in proper order.

Hope this helps!

ADD COMMENTlink modified 6.2 years ago • written 6.2 years ago by Kenosis1.2k

Perfect ! this works really welly and easy to understand as well. Something new i learnt today. Thank you

ADD REPLYlink written 6.2 years ago by empyrean999170

You're most welcome!

ADD REPLYlink written 6.2 years ago by Kenosis1.2k
0
gravatar for QVINTVS_FABIVS_MAXIMVS
6.2 years ago by
USA SoCal
QVINTVS_FABIVS_MAXIMVS2.4k wrote:

You tagged Perl. You can use a hash and for the key you can concatenate the FASTA id and sequence with a unique delimiter.

$key = join "~", $id,$seq;
$seqHash{$key}++;

....

foreach $key (sort keys %seqHash){
@temp = split /~/, $key;
print $temp[0]."\t".$seqHash{$key};
}
ADD COMMENTlink written 6.2 years ago by QVINTVS_FABIVS_MAXIMVS2.4k

i dont want to combine id and sequence and search as the id is different in tags and other fasta files

ADD REPLYlink written 6.2 years ago by empyrean999170

$id == $sample as you mentioned above. If you already uniqued your seqs? So are you trying to count identical reads across samples? You would have to use the sequence as a key in Perl. But I'm sure there's a better way to do this like in Python

ADD REPLYlink modified 6.2 years ago • written 6.2 years ago by QVINTVS_FABIVS_MAXIMVS2.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 713 users visited in the last hour