Question: randomly sample sequences of fasta file per sample name
0
gravatar for nmkn
2.3 years ago by
nmkn0
nmkn0 wrote:

I have a task that I'm sure has been done before but I can't find a simple solution. Given a fasta file, with multiple sequences per species/individual name, I want to randomly sample one sequence per species/individual. Not each species/individual will always have the same number of sequences.

I've found a few similar posts (https://www.biostars.org/p/18831/), but they randomly sample a subset based on a given percentage. I think I want to do something similar to this post, but that was never fully answered: Help with randomly sampling from a fasta file.

Here's my example:

>SpeciesA.seq0   CCACTTTA......
>SpeciesA.seq1   CCTCTTTA......
>SpeciesA.seq2   CCGCTTTA......
>SpeciesA.seq3   CCACTTTA......
>SpeciesB.seq0   GCCCTTTA......
>SpeciesB.seq1   GCCCTTTA.....
>SpeciesB.seq2   ACCCTTTA.....
>SpeciesB.seq3   GCCCTTTA.....
>SpeciesC.seq0   GCCCTTTA.....
>SpeciesC.seq1   GCCCTTTA.....

I want to randomly select one sequence per species/individual so the output will look like this:

>SpeciesA.seq3   CCACTTTA......
>SpeciesB.seq1   GCCCTTTA.....
>SpeciesC.seq1   GCCCTTTA.....

Ideally, the resulting sequences will be concatenated/pasted into a new file with the same name as the starting input fasta file (which corresponds to a gene name). I only have limited bash experience so help would be greatly appreciated!

random sequence fasta • 1.5k views
ADD COMMENTlink modified 2.3 years ago by cpad011213k • written 2.3 years ago by nmkn0

If an answer below was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.
Upvote|Bookmark|Accept

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by genomax85k
3
gravatar for JC
2.3 years ago by
JC10k
Mexico
JC10k wrote:

Perl:

#!/usr/bin/perl

use strict;
use warnings;

my %seqs =();

$/="\n>";
while (<>) {
    s/>//g;
    my ($name, @seq) = split (/\n/, $_);
    my ($sp, $id) = split (/\./, $name);
    push @{ $seqs{$sp} }, join "\n", ">$name", @seq;
}

foreach my $sp (keys %seqs) {
    print $seqs{$sp}[ int( rand @{ $seqs{$sp} }) ], "\n";
}

Examples:

perl rand.pl < seqs.fa
>SpeciesC.seq0
GCCCTTTA.....
>SpeciesA.seq0
CCACTTTA......
>SpeciesB.seq1
GCCCTTTA.....
$ perl rand.pl < seqs.fa
>SpeciesB.seq0
GCCCTTTA......
>SpeciesC.seq1
GCCCTTTA.....
>SpeciesA.seq1
CCTCTTTA......

Explanation: use a hash of arrays, each species collects their sequences in an array, then just iterate over species getting one random value of the array length.

ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by JC10k

Thank you very much! This perl option works well and I think will be the most efficient for many files. Just a follow up - how can I get this to work as a loop since I have hundreds of fasta files? Sorry if that wasn't clear in the original post. I'm having issues:

for file in *.fasta; do (perl rand.pl<"$file")>"$file_new"; done

Thanks

ADD REPLYlink written 2.3 years ago by nmkn0
1

that bash line should work, just define "file_new" or change the name:

for file in *fasta; do perl rand.pl < $file > ${file%.fasta}_subset.fasta; done
ADD REPLYlink written 2.3 years ago by JC10k

That solves it and it was very fast - thank you very much! And thank you everyone else for your answers too.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by nmkn0
0
gravatar for Pierre Lindenbaum
2.3 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum129k wrote:
  • shuffle your linearized fasta with shuf
  • convert the dot to tab with tr
  • use sort one the first column using --stable and -u (unique)
  • convert back the first tab to dot

     shuf linearized.tsv  | tr "." "\t" | sort -t $'\t' -k1,1 --stable -u | sed 's/\t/./'
    
ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by Pierre Lindenbaum129k

Thank you, but I'm not sure if this does what I want. First, I have standard fasta files not .tsv files. Also, how does this randomly select one sequence per sample name and paste into a new file? I apologize if i just don’t understand.

ADD REPLYlink written 2.3 years ago by nmkn0

First, I have standard fasta files not .tsv files

linearize as described in the other posts. see https://gist.github.com/lindenb/2c0d4e11fd8a96d4c345

Also, how does this randomly select one sequence per sample name

this is the aim of the -u flag in sort

and paste into a new file?

do you want into a new file. use redirection: https://stackoverflow.com/questions/876239/how-can-i-redirect-and-append-both-stdout-and-stderr-to-a-file-with-bash

all in one:

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' in.fa | shuf  | tr "." "\t" | sort -t $'\t' -k1,1 --stable -u | sed 's/\t/./' | tr "\t" "\n" > out.fa
ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by Pierre Lindenbaum129k
0
gravatar for cpad0112
2.3 years ago by
cpad011213k
India
cpad011213k wrote:

with seqkit and bash (Assuming that input is in fasta format, not in tsv as in OP and fasta records are linearized):

for i in $(seqkit seq -n test.fa | cut -f1 -d"." | uniq);
    do seqkit grep -rp $i test.fa  | seqkit seq -n | shuf -n1 | grep -A1 -f - test.fa;
done;

with seqkit, datamash and sed:

 $ seqkit fx2tab test.fa | sed 's/\./\t/' | datamash -sg1 rand 2,3 | sed 's/\t/\./' | seqkit tab2fx

with awk, sed and datamash:

$ awk 'BEGIN{RS=">"}{print $1"\t"$2}' test.fa |sed -e '1d;s/\./\t/g'| datamash -sg1 rand 2,3 | sed 's/\t/\./' |  awk 'BEGIN{RS="\n"}{print ">"$1"\n"$2}
ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by cpad011213k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1115 users visited in the last hour