Choosing Random Set Of Seqs From Larger Set
12
13
Entering edit mode
10.2 years ago
razor ▴ 190

Do you know any quick and dirty solution for choosing a random subset of sequences (with or without replacement) from a larger set in a fasta file?

Something like:

command -size 10 -in lotsofseqs.fasta -out sample.fasta
random fasta • 13k views
ADD COMMENT
23
Entering edit mode
10.2 years ago

linearize your sequences, generate a random order with shuf and print the 'N' first lines.

cat input.fasta |\
awk '/^>/ { if(i>0) printf("\n"); i++; printf("%s\t",$0); next;} {printf("%s",$0);} END { printf("\n");}' |\
shuf |\
head -n 2 |\
awk '{printf("%s\n%s\n",$1,$2)}'
ADD COMMENT
4
Entering edit mode

I know someone is going to give this shuf solution. :) A problem with it is that shuf loads all the data into memory, which makes it not suitable for very large files. If we only downsample a small file from large one, my solution is much better. In addition, if the input parameter is a fraction rather than a fixed number, hashing will be even better.

ADD REPLY
0
Entering edit mode

shuf uses reservoir sampling (bounded memory) since v8.22

ADD REPLY
1
Entering edit mode

shuf is not native to Macs.

ADD REPLY
1
Entering edit mode

maasha, if you want to install them, and use brew, then brew install xz coreutils will get you shuf (defined as gshuf)

ADD REPLY
1
Entering edit mode

I used it for one of my datasets and I've found that to make it more kind of "accurate" it would be better to add BEGIN{FS="\t"} in the last awk:

...

head -n 2 |\
awk 'BEGIN{FS="\t"}{printf("%s\n%s\n",$1,$2)}'

That way if the fasta header is longer, with at least one space, like:

>MC3.MG1.AS1.GP1.C1.G2 OG=COG1564|Taxon=bacteria|StartStop=11|ClusterSize=83
ATGAAAACAGAAAGCAAATATTGTCTCATTGTCACAGGTGGCTCGGTCGATCAGAATTTGCTCGCGACCGTTTATGA

after shuffling, it won't become:

>MC3.MG80.AS1.GP1.C14688.G3
OG=NOG80451|Taxon=bacteria|StartStop=11|ClusterSize=2
>MC3.MG235.AS1.GP1.C49859.G6...

Hope it helps

ADD REPLY
0
Entering edit mode

that's it! no dependencies, bio packages, just basic commands. (somehow i never used shuf before.)

ADD REPLY
0
Entering edit mode

In place of shuf, you can do this on a Mac:

perl -MList::Util -e 'print List::Util::shuffle <>'
ADD REPLY
0
Entering edit mode

A slight variation of the linearization with awk, explained in more detail: https://riptutorial.com/bioinformatics/example/14692/linearize-a-fasta-sequence-with-awk

DISCLAIMER: This code and its explanation is elaborated by someone else. I just add it as supplementary information to further understand the best answer provided to the question

ADD REPLY
7
Entering edit mode
10.2 years ago
lh3 33k

I used to provide an answer at seqanswers.

You need bioawk for it to work. With bioawk, you can:

awk -c fastx -v k=10 '{y=x++<k?x-1:int(rand()*x);if(y<k)a[y]=">"$name"\n"$seq}END{for(z in a)print a[z]}' seq.fa.gz

This solution is based on reservoir sampling. It is optimal in terms of both time and space complexity and works with an arbitrarily large input file. Most other solutions need to read all data into memory.

Of course you can also use perl/python to re-implement the awk line. I use bioawk only because it has a built-in fasta/fastq parser.

ADD COMMENT
1
Entering edit mode

This is without replacement.

ADD REPLY
1
Entering edit mode

No, this is not true. Reservoir sampling may seem too simplistic, but it guarantees uniform randomness. You can find the proof via google. It is theoretically the best solution for sampling a fixed number of records without replacement. You even do not need to know the size of reservior.

ADD REPLY
1
Entering edit mode

The first time I read about reservoir sampling is when I saw that seqanswers post. By intuition, I think there must be a solution working on a data stream - those who work on endless time series must be faced with such a situation all the time. Nonetheless, I did not know the answer, so I googled and found reservior sampling.

ADD REPLY
0
Entering edit mode

The thing about reservoir sampling is that there is 'decreasing probability', once the k first elements have been filled, that a new sequence will be retained, thus making sequences at the end of the file less likely to be taken and sequences from the beginning of the file more likely to be replaced. How close to random selection will the resulting list of sequences really be? I provide a solution that works with gigantic files, at the cost of having to scan the file twice and doing some extra trimming but ensuring that everybody gets equal chances.

ADD REPLY
0
Entering edit mode

I'll look it up, it is a very interesting solution.

ADD REPLY
0
Entering edit mode

I'm curious, did you learn about reservoir sampling in a book about algorithms or when searching for a solution to a specific problem you had? (Just trying to decide whether to spend time to read through a book on algo ;)

ADD REPLY
6
Entering edit mode
10.2 years ago
Neilfws 49k

This page has lots of useful sequence utilities, including a script named random_sequence_reads.pl (zip or tarball to download). It does sampling with/without replacement and seems to work, based on my quick test.

Note: the test.sh script samples all sequences from the input file which is confusing and not useful - edit it to change the parameters.

ADD COMMENT
3
Entering edit mode
10.2 years ago

This can be done with Biopieces like this:

read_fasta -i lotsofseqs.fasta | random_records -n 10 | write_fasta -o sample.fasta -x
ADD COMMENT
0
Entering edit mode

will definitely install this later, after installing all the prereqs, external tools, etc.

ADD REPLY
3
Entering edit mode
10.2 years ago

Here is something quick and dirty in python:

import sys,random
from Bio import SeqIO

num = int(sys.argv[2])

outArray = [record for record in SeqIO.parse(open(sys.argv[1],'r'),'fasta') if random.randint(0,1)]
while len(outArray)!= num:
    newArray = []
    for record in outArray:
        if len(newArray) == num:
            break
        if random.randint(0,1):
            newArray.append(record)
    outArray = newArray

for record in outArray:
    print ">" + record.id
    print record.seq

Use by:

python myScript.py in.fasta 10 > subset10.fa

It basically iteratively randomly takes a subset (around half) until it gets to the number of sequences you want. So if you have 1000 sequences, it will take randomly around 500 of that, then 250 of that, 125....so on the so forth until it gets to the number if you want.

This script will probably only work well if you have a large amount of sequences (1000+) and the subset you want is well below half of the total amount.

ADD COMMENT
0
Entering edit mode

This is not a random subsampling. It systematically is biased to sequences at the beginning of the file.

ADD REPLY
2
Entering edit mode
10.2 years ago
Malcolm.Cook ★ 1.3k

A little late to the fray, but anyway...

Here's a solution for sampling fasta without replacement.

It uses bash functions, and GNU coreutils' shuf command, but does not suffer the memory requirements of keeping the sequences in memory since it shuffles only the byte offsets into the file where the sequences begin.

function fastaSubsample () {
    ## PURPOSE: choose [HTML] random records from a (presumably
    ## well-formed) fasta file in [HTML].  The output sequences
    ## are in same order as input (due to the `sort -n`)
    n="$1"
    path="$2"
    shift 2
    fastaBreaks ${path} | shuf -n ${n} | sort -n | fastaRecordsAt ${path}
}

function fastaBreaks () {
    ## PURPOSE: output the byte positions of each fasta record on STDIN
    perl -ne 'BEGIN{$told=0}; print($told.qq{\n}) if m/^>/;$told=tell();' "$@"
}

function fastaRecordsAt () {
    ## PURPOSE: extract the fasta records from# [HTML] starting at
    ## each byte offset appearing on STDIN.
    path="$1"
    shift 1
    perl -anse 'BEGIN{open(PATH,$path)}; chomp;  $told = $_; seek(PATH,$told,0); print(scalar(readline(PATH))); while (($l=readline(PATH)) && ($l !~ m/^>/)) {print($l)} ' -- -path=$path
}
ADD COMMENT
2
Entering edit mode
9.9 years ago
jmun7616 ▴ 20

Here is another Python solution, useful for processing multiple file simultaneously (must be in same dir):

import random
import glob
inFiles = glob.glob('*.fas')
outFiles = []
num = int(raw_input("Enter number of random sequences to select:\n"))


for i in range(len(inFiles)):

        for k in range(3):
                fNames = []
                fSeqs = []
                outFiles.append(file(str(inFiles[i])+'_'+'Rand_'+str(num)+'-'+str(k+1)+'.fasta', 'wt'))
                for line in open(inFiles[i]):
                        if (line[0] == '>'):
                                fNames.append(line)
                        else:
                                fSeqs.append(line)
                curr = (len(outFiles)-1)
                for j in range(num):
                        a = random.randint(0, (len(fNames)-1))
                        outFiles[curr].write(fNames.pop(a))
                        outFiles[curr].write(fSeqs.pop(a))
raw_input("Done.")

Basically you just enter how large the subsample should be, and it will give you three random subsamplings (without replacement). Of course if you only need one then only use one. If the file has less sequences than request this script will die.

I found this useful in a 16S tagged pyrosequencing project I've been working on.

ADD COMMENT
2
Entering edit mode
8.2 years ago

I offer one solution here to sample without replacement, which does not require any subsampling and uses roughly 24 bytes per FASTA sequence.

The tradeoff is that one has to read through the file twice, the first time to read in byte offsets and a second and subsequent time for each random sample. But it isn't necessary to read in any of the FASTA set into memory, just line offsets.

ADD COMMENT
1
1
Entering edit mode
3.5 years ago
mdtoney ▴ 20
import sys,random
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import generic_protein

# Use: python   scriptname.py   number_of_random_seq   infile.fasta   outfile.fasta

infile = sys.argv[2]                                #Name of the input file
seq = list(SeqIO.parse(infile,"fasta"))             #Create a list with all the sequence records
print "Input fasta file = ", infile

totseq = len(seq)                                   #Total number of sequences in the input file
print "Number of sequences in the original file = ", len(seq)

randseq = int(sys.argv[1])                          #Number of random sequences desired
print "Number of random sequences desired = ", randseq

if randseq > totseq:
  print "The requested number of random sequences is greater that the total number of input sequences. Exiting."
  exit()

outfile = sys.argv[3]                               #Name of the output file
print "Output fasta file = ", outfile

outrandseq = []
outlist = []
print "Randomly chosen output sequences:"

for i in range(randseq):
  choose = random.randint(1,totseq-1)               #Choose a random sequence record number
  for j in range(len(outrandseq)):                  #Test to see if the random sequence record number has already been chosen
    if choose == outrandseq[j]:
      choose = random.randint(1,totseq-1)           #Choose a new random sequence record number if the current one has already been chosen
  outrandseq.append(choose)
  print choose
  outseq = seq[choose]
  outlist.append(outseq)                            #Append seq record to output list

SeqIO.write(outlist, outfile, "fasta")              #Write the output list to the outfile

exit()
ADD COMMENT
0
Entering edit mode

OK, finally figured out how to make it look right.

ADD REPLY
0
Entering edit mode

I'm a part-time hack, so this can probably be done much better.

ADD REPLY
1
Entering edit mode
3.5 years ago
mdtoney ▴ 20
import sys,random
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import generic_protein

# I use this from small numbers of sequences (input file up to 10000 sequences) and it works fine. 
# For very large sequence sets it may be too slow -- I just have not tried.

# Use: python   scriptname.py   number_of_random_seq   infile.fasta   outfile.fasta

infile = sys.argv[2]                                #Name of the input file
seq = list(SeqIO.parse(infile,"fasta"))             #Create a list with all the sequence records
print "Input fasta file = ", infile

totseq = len(seq)                                   #Total number of sequences in the input file
print "Number of sequences in the original file = ", totseq

numrandseq = int(sys.argv[1])                       #Number of random sequences desired
print "Number of random sequences desired = ", numrandseq

if numrandseq > totseq:
  print "The requested number of random sequences is greater that the total number of input sequences. Exiting."
  exit()

outfile = sys.argv[3]                               #Name of the output file
print "Output fasta file = ", outfile

outrandseqset = []
i = 1
for i in range(numrandseq):                         #Create a list of random sequence record numbers for output
  choice = random.randint(1,totseq)
  outrandseqset.append(choice)

i = 1
j = 1
duplicate = 1
while duplicate:                                    #Make sure no sequences are duplicated in the list
    duplicate = 0
    for i in range(numrandseq):
      for j in range(i+1, numrandseq):
        if outrandseqset[i] == outrandseqset[j]:
            outrandseqset[j] = random.randint(1,totseq)
            duplicate = 1


i = 1
print "Randomly chosen output sequences:"
for i in range(numrandseq):
  print outrandseqset[i]

outlist = []
i = 1
for i in range(numrandseq):                         #Create the list of seq records to be written to the output file
  seqnum = outrandseqset[i]
  outseq = seq[seqnum]
  outlist.append(outseq)

SeqIO.write(outlist, outfile, "fasta")              #Write the output list to the outfile

exit()
ADD COMMENT
0
Entering edit mode

New version of my script with loop to make sure there are no duplicates in the output file.

ADD REPLY
0
Entering edit mode

Yea, I'm sure there is some naive/unnecessary stuff in here, but I don't want to hear your pedantics.

ADD REPLY
0
Entering edit mode

but I don't want to hear your pedantics

That is not the attitude of a learner or of someone whose contribution is welcome on open science forums.

ADD REPLY

Login before adding your answer.

Traffic: 1536 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6