Choosing Random Set Of Seqs From Larger Set
12
13
Entering edit mode
10.2 years ago
razor ▴ 190

Do you know any quick and dirty solution for choosing a random subset of sequences (with or without replacement) from a larger set in a fasta file?

Something like:

command -size 10 -in lotsofseqs.fasta -out sample.fasta

random fasta • 13k views
23
Entering edit mode
10.2 years ago

linearize your sequences, generate a random order with shuf and print the 'N' first lines.

cat input.fasta |\
awk '/^>/ { if(i>0) printf("\n"); i++; printf("%s\t",$0); next;} {printf("%s",$0);} END { printf("\n");}' |\
shuf |\
awk '{printf("%s\n%s\n",$1,$2)}'

4
Entering edit mode

I know someone is going to give this shuf solution. :) A problem with it is that shuf loads all the data into memory, which makes it not suitable for very large files. If we only downsample a small file from large one, my solution is much better. In addition, if the input parameter is a fraction rather than a fixed number, hashing will be even better.

0
Entering edit mode

shuf uses reservoir sampling (bounded memory) since v8.22

1
Entering edit mode

shuf is not native to Macs.

1
Entering edit mode

maasha, if you want to install them, and use brew, then brew install xz coreutils will get you shuf (defined as gshuf)

1
Entering edit mode

I used it for one of my datasets and I've found that to make it more kind of "accurate" it would be better to add BEGIN{FS="\t"} in the last awk:

...

head -n 2 |\
awk 'BEGIN{FS="\t"}{printf("%s\n%s\n",$1,$2)}'


That way if the fasta header is longer, with at least one space, like:

>MC3.MG1.AS1.GP1.C1.G2 OG=COG1564|Taxon=bacteria|StartStop=11|ClusterSize=83
ATGAAAACAGAAAGCAAATATTGTCTCATTGTCACAGGTGGCTCGGTCGATCAGAATTTGCTCGCGACCGTTTATGA


after shuffling, it won't become:

>MC3.MG80.AS1.GP1.C14688.G3
OG=NOG80451|Taxon=bacteria|StartStop=11|ClusterSize=2
>MC3.MG235.AS1.GP1.C49859.G6...


Hope it helps

0
Entering edit mode

that's it! no dependencies, bio packages, just basic commands. (somehow i never used shuf before.)

0
Entering edit mode

In place of shuf, you can do this on a Mac:

perl -MList::Util -e 'print List::Util::shuffle <>'

0
Entering edit mode

A slight variation of the linearization with awk, explained in more detail: https://riptutorial.com/bioinformatics/example/14692/linearize-a-fasta-sequence-with-awk

DISCLAIMER: This code and its explanation is elaborated by someone else. I just add it as supplementary information to further understand the best answer provided to the question

7
Entering edit mode
10.2 years ago
lh3 33k

You need bioawk for it to work. With bioawk, you can:

awk -c fastx -v k=10 '{y=x++<k?x-1:int(rand()*x);if(y<k)a[y]=">"$name"\n"$seq}END{for(z in a)print a[z]}' seq.fa.gz


This solution is based on reservoir sampling. It is optimal in terms of both time and space complexity and works with an arbitrarily large input file. Most other solutions need to read all data into memory.

Of course you can also use perl/python to re-implement the awk line. I use bioawk only because it has a built-in fasta/fastq parser.

1
Entering edit mode

This is without replacement.

1
Entering edit mode

No, this is not true. Reservoir sampling may seem too simplistic, but it guarantees uniform randomness. You can find the proof via google. It is theoretically the best solution for sampling a fixed number of records without replacement. You even do not need to know the size of reservior.

1
Entering edit mode

The first time I read about reservoir sampling is when I saw that seqanswers post. By intuition, I think there must be a solution working on a data stream - those who work on endless time series must be faced with such a situation all the time. Nonetheless, I did not know the answer, so I googled and found reservior sampling.

0
Entering edit mode

The thing about reservoir sampling is that there is 'decreasing probability', once the k first elements have been filled, that a new sequence will be retained, thus making sequences at the end of the file less likely to be taken and sequences from the beginning of the file more likely to be replaced. How close to random selection will the resulting list of sequences really be? I provide a solution that works with gigantic files, at the cost of having to scan the file twice and doing some extra trimming but ensuring that everybody gets equal chances.

0
Entering edit mode

I'll look it up, it is a very interesting solution.

0
Entering edit mode

I'm curious, did you learn about reservoir sampling in a book about algorithms or when searching for a solution to a specific problem you had? (Just trying to decide whether to spend time to read through a book on algo ;)

6
Entering edit mode
10.2 years ago
Neilfws 49k

This page has lots of useful sequence utilities, including a script named random_sequence_reads.pl (zip or tarball to download). It does sampling with/without replacement and seems to work, based on my quick test.

Note: the test.sh script samples all sequences from the input file which is confusing and not useful - edit it to change the parameters.

3
Entering edit mode
10.2 years ago

This can be done with Biopieces like this:

read_fasta -i lotsofseqs.fasta | random_records -n 10 | write_fasta -o sample.fasta -x

0
Entering edit mode

will definitely install this later, after installing all the prereqs, external tools, etc.

3
Entering edit mode
10.2 years ago

Here is something quick and dirty in python:

import sys,random
from Bio import SeqIO

num = int(sys.argv[2])

outArray = [record for record in SeqIO.parse(open(sys.argv[1],'r'),'fasta') if random.randint(0,1)]
while len(outArray)!= num:
newArray = []
for record in outArray:
if len(newArray) == num:
break
if random.randint(0,1):
newArray.append(record)
outArray = newArray

for record in outArray:
print ">" + record.id
print record.seq


Use by:

python myScript.py in.fasta 10 > subset10.fa


It basically iteratively randomly takes a subset (around half) until it gets to the number of sequences you want. So if you have 1000 sequences, it will take randomly around 500 of that, then 250 of that, 125....so on the so forth until it gets to the number if you want.

This script will probably only work well if you have a large amount of sequences (1000+) and the subset you want is well below half of the total amount.

0
Entering edit mode

This is not a random subsampling. It systematically is biased to sequences at the beginning of the file.

2
Entering edit mode
10.2 years ago
Malcolm.Cook ★ 1.3k

A little late to the fray, but anyway...

Here's a solution for sampling fasta without replacement.

It uses bash functions, and GNU coreutils' shuf command, but does not suffer the memory requirements of keeping the sequences in memory since it shuffles only the byte offsets into the file where the sequences begin.

function fastaSubsample () {
## PURPOSE: choose [HTML] random records from a (presumably
## well-formed) fasta file in [HTML].  The output sequences
## are in same order as input (due to the sort -n)
n="$1" path="$2"
shift 2
fastaBreaks ${path} | shuf -n${n} | sort -n | fastaRecordsAt ${path} } function fastaBreaks () { ## PURPOSE: output the byte positions of each fasta record on STDIN perl -ne 'BEGIN{$told=0}; print($told.qq{\n}) if m/^>/;$told=tell();' "$@" } function fastaRecordsAt () { ## PURPOSE: extract the fasta records from# [HTML] starting at ## each byte offset appearing on STDIN. path="$1"
shift 1
perl -anse 'BEGIN{open(PATH,$path)}; chomp;$told = $_; seek(PATH,$told,0); print(scalar(readline(PATH))); while (($l=readline(PATH)) && ($l !~ m/^>/)) {print($l)} ' -- -path=$path
}

2
Entering edit mode
9.9 years ago
jmun7616 ▴ 20

Here is another Python solution, useful for processing multiple file simultaneously (must be in same dir):

import random
import glob
inFiles = glob.glob('*.fas')
outFiles = []
num = int(raw_input("Enter number of random sequences to select:\n"))

for i in range(len(inFiles)):

for k in range(3):
fNames = []
fSeqs = []
outFiles.append(file(str(inFiles[i])+'_'+'Rand_'+str(num)+'-'+str(k+1)+'.fasta', 'wt'))
for line in open(inFiles[i]):
if (line[0] == '>'):
fNames.append(line)
else:
fSeqs.append(line)
curr = (len(outFiles)-1)
for j in range(num):
a = random.randint(0, (len(fNames)-1))
outFiles[curr].write(fNames.pop(a))
outFiles[curr].write(fSeqs.pop(a))
raw_input("Done.")


Basically you just enter how large the subsample should be, and it will give you three random subsamplings (without replacement). Of course if you only need one then only use one. If the file has less sequences than request this script will die.

I found this useful in a 16S tagged pyrosequencing project I've been working on.

2
Entering edit mode
8.2 years ago

I offer one solution here to sample without replacement, which does not require any subsampling and uses roughly 24 bytes per FASTA sequence.

The tradeoff is that one has to read through the file twice, the first time to read in byte offsets and a second and subsequent time for each random sample. But it isn't necessary to read in any of the FASTA set into memory, just line offsets.

1
Entering edit mode
8.1 years ago

You can use reservoir sampling which is O(N) and does it in one pass (quite handy for millions of Illumina paired-end reads). I have written one-liners to do that in awk:

1
Entering edit mode
3.5 years ago
mdtoney ▴ 20
import sys,random
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import generic_protein

# Use: python   scriptname.py   number_of_random_seq   infile.fasta   outfile.fasta

infile = sys.argv[2]                                #Name of the input file
seq = list(SeqIO.parse(infile,"fasta"))             #Create a list with all the sequence records
print "Input fasta file = ", infile

totseq = len(seq)                                   #Total number of sequences in the input file
print "Number of sequences in the original file = ", len(seq)

randseq = int(sys.argv[1])                          #Number of random sequences desired
print "Number of random sequences desired = ", randseq

if randseq > totseq:
print "The requested number of random sequences is greater that the total number of input sequences. Exiting."
exit()

outfile = sys.argv[3]                               #Name of the output file
print "Output fasta file = ", outfile

outrandseq = []
outlist = []
print "Randomly chosen output sequences:"

for i in range(randseq):
choose = random.randint(1,totseq-1)               #Choose a random sequence record number
for j in range(len(outrandseq)):                  #Test to see if the random sequence record number has already been chosen
if choose == outrandseq[j]:
choose = random.randint(1,totseq-1)           #Choose a new random sequence record number if the current one has already been chosen
outrandseq.append(choose)
print choose
outseq = seq[choose]
outlist.append(outseq)                            #Append seq record to output list

SeqIO.write(outlist, outfile, "fasta")              #Write the output list to the outfile

exit()

0
Entering edit mode

OK, finally figured out how to make it look right.

0
Entering edit mode

I'm a part-time hack, so this can probably be done much better.

1
Entering edit mode
3.5 years ago
mdtoney ▴ 20
import sys,random
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import generic_protein

# I use this from small numbers of sequences (input file up to 10000 sequences) and it works fine.
# For very large sequence sets it may be too slow -- I just have not tried.

# Use: python   scriptname.py   number_of_random_seq   infile.fasta   outfile.fasta

infile = sys.argv[2]                                #Name of the input file
seq = list(SeqIO.parse(infile,"fasta"))             #Create a list with all the sequence records
print "Input fasta file = ", infile

totseq = len(seq)                                   #Total number of sequences in the input file
print "Number of sequences in the original file = ", totseq

numrandseq = int(sys.argv[1])                       #Number of random sequences desired
print "Number of random sequences desired = ", numrandseq

if numrandseq > totseq:
print "The requested number of random sequences is greater that the total number of input sequences. Exiting."
exit()

outfile = sys.argv[3]                               #Name of the output file
print "Output fasta file = ", outfile

outrandseqset = []
i = 1
for i in range(numrandseq):                         #Create a list of random sequence record numbers for output
choice = random.randint(1,totseq)
outrandseqset.append(choice)

i = 1
j = 1
duplicate = 1
while duplicate:                                    #Make sure no sequences are duplicated in the list
duplicate = 0
for i in range(numrandseq):
for j in range(i+1, numrandseq):
if outrandseqset[i] == outrandseqset[j]:
outrandseqset[j] = random.randint(1,totseq)
duplicate = 1

i = 1
print "Randomly chosen output sequences:"
for i in range(numrandseq):
print outrandseqset[i]

outlist = []
i = 1
for i in range(numrandseq):                         #Create the list of seq records to be written to the output file
seqnum = outrandseqset[i]
outseq = seq[seqnum]
outlist.append(outseq)

SeqIO.write(outlist, outfile, "fasta")              #Write the output list to the outfile

exit()

0
Entering edit mode

New version of my script with loop to make sure there are no duplicates in the output file.

0
Entering edit mode

Yea, I'm sure there is some naive/unnecessary stuff in here, but I don't want to hear your pedantics.

0
Entering edit mode

but I don't want to hear your pedantics

That is not the attitude of a learner or of someone whose contribution is welcome on open science forums.