Unique-Ify The Contents Of A Fasta-Formatted File And Associated Read Numbers
2
0
Entering edit mode
10.0 years ago
2011101101 ▴ 110

I have a file and it's format is in below

>PH1_2044586      3
CAAGT
>SM1_2913489      2
CAAGT
>PH2_3543355      2
TTCTG
>WP2_579263      1
TTCTG


The result is like this

>1    5
CAAGT
>2    3
TTCTG

fasta format reads • 2.3k views
1
Entering edit mode
10.0 years ago

You can use groupBy from bedtools:

cat your.fa | paste - - | perl -ane 'print join("\t",@F)."\n";' | sort -k3 | groupBy -g 3 -c 2 -o sum | perl -ane '$a++;print ">$a\t$F[1]\n$F[0]\n";'

0
Entering edit mode

You are assuming that every sequence will be very short, your paste command will break on longer ones. Also, there is no need to use cat, just run paste - - your.fa | perl ....

0
Entering edit mode

You're right, it can be written much nicer. I just wanted to show one way of solving the problem and I always start with a 'cat', or a 'more; and build up the pipe step-by-step, without cleaning up the previous call(s). And I assume, that the sequences are NGS reads and thus, they will not be of great length and 'paste' should work great here. But I think if there are millions of reads, the 'sort' might run forever. As I said.... just one possibility. When having millions of entries, a perl-hash needs a lot of main memory and a 'sort' needs a lot of time.... :)

0
Entering edit mode

A Perl hash won't use that much memory (an array will use more), storing 5 million 100bp seqs used 1.5g of RAM on my machine, and if that is prohibitive you don't have to store things in memory. The downside is that memory is a lot faster than disk IO. The memory usage should be low for this data set if these are the actual read lengths.

0
Entering edit mode

well.... current NGS experiments can have up to 200 million 100bp sequences. But you are right. Here is the command line code with a perl hash: "perl -ane 'if(/>/){$c=$F[1];}else{$h{$F[0]}+=$c;}END{foreach(keys %h){$a++;print ">$a\t$h{$_}\n$_\n";}}' test.fa"

0
Entering edit mode

You are right about NGS data sets and memory usage; I would take a different approach in that case. We don't know anything about the data/question in this case, so I think any approach should be equally fine.

0
Entering edit mode

Thank you for your answer ,my data is small RNA and the unique reads about 700million ,but I have solved my question by using David Langenberger's command .

1
Entering edit mode
10.0 years ago
SES 8.5k

Here is a BioPerl version:

#!/usr/bin/env perl

use strict;
use warnings;
use Bio::SeqIO;

my %seqs;
my $cts = 0; my$seqio = Bio::SeqIO->new(-fh => \*DATA, -format => 'fasta');

while (my $seq =$seqio->next_seq) {
$seqs{$seq->seq} += $seq->desc; } for my$mer (reverse sort {$seqs{$a} <=> $seqs{$b}} keys %seqs) {
$cts++; print ">$cts  $seqs{$mer}\n$mer\n"; } __DATA__ >PH1_2044586 3 CAAGT >SM1_2913489 2 CAAGT >PH2_3543355 2 TTCTG >WP2_579263 1 TTCTG  Output: $ perl biostars67954.pl
>1  5
CAAGT
>2  3
TTCTG

0
Entering edit mode

Thank you very much .I will learn the bioperl,Thank you again