Question: Choosing Random Set Of Seqs From Larger Set
7
gravatar for razor
6.7 years ago by
razor160
Barcelona
razor160 wrote:

Do you know any quick and dirty solution for choosing a random subset of sequences (with or without replacement) from a larger set in a fasta file?

Something like: command -size 10 -in lotsofseqs.fasta -out sample.fasta

fasta random • 8.5k views
ADD COMMENTlink modified 25 days ago by mdtoney0 • written 6.7 years ago by razor160
18
gravatar for Pierre Lindenbaum
6.7 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum115k wrote:

linearize your sequences, generate a random order with shuf and print the 'N' first lines.

cat input.fasta |\
awk '/^>/ { if(i>0) printf("\n"); i++; printf("%s\t",$0); next;} {printf("%s",$0);} END { printf("\n");}' |\
shuf |\
head -n 2 |\
awk '{printf("%s\n%s\n",$1,$2)}'
ADD COMMENTlink written 6.7 years ago by Pierre Lindenbaum115k
4

I know someone is going to give this shuf solution. :) A problem with it is that shuf loads all the data into memory, which makes it not suitable for very large files. If we only downsample a small file from large one, my solution is much better. In addition, if the input parameter is a fraction rather than a fixed number, hashing will be even better.

ADD REPLYlink written 6.7 years ago by lh331k

shuf uses reservoir sampling (bounded memory) since v8.22

ADD REPLYlink written 4.7 years ago by pixelbeat0
1

shuf is not native to Macs.

ADD REPLYlink written 6.7 years ago by Martin A Hansen3.0k
1

maasha, if you want to install them, and use brew, then brew install xz coreutils will get you shuf (defined as gshuf)

ADD REPLYlink written 6.7 years ago by Malcolm.Cook900
1

 I used it for one of my datasets and I've found that to make it more kind of "accurate" it would be better to add "BEGIN{FS="\t"}" in the last awk:

...

head -n 2 |\
awk 'BEGIN{FS="\t"}{printf("%s\n%s\n",$1,$2)}'

That way if the fasta header is longer, with at least one space, like:

>MC3.MG1.AS1.GP1.C1.G2 OG=COG1564|Taxon=bacteria|StartStop=11|ClusterSize=83
ATGAAAACAGAAAGCAAATATTGTCTCATTGTCACAGGTGGCTCGGTCGATCAGAATTTGCTCGCGACCGTTTATGA

after shuffling, it won't become:

>MC3.MG80.AS1.GP1.C14688.G3
OG=NOG80451|Taxon=bacteria|StartStop=11|ClusterSize=2
>MC3.MG235.AS1.GP1.C49859.G6...

Hope it helps

 

ADD REPLYlink written 4.5 years ago by xmarti620

that's it! no dependencies, bio packages, just basic commands. (somehow i never used shuf before.)

ADD REPLYlink written 6.7 years ago by razor160

In place of shuf, you can do this on a Mac:

perl -MList::Util -e 'print List::Util::shuffle <>'

ADD REPLYlink written 6.7 years ago by SES8.1k
7
gravatar for lh3
6.7 years ago by
lh331k
United States
lh331k wrote:

I used to provide an answer at seqanswers. You need bioawk for it to work. With bioawk, you can:

awk -c fastx -v k=10 '{y=x++<k?x-1:int(rand()*x);if(y<k)a[y]=">"$name"\n"$seq}END{for(z in a)print a[z]}' seq.fa.gz

This solution is based on reservoir sampling. It is optimal in terms of both time and space complexity and works with an arbitrarily large input file. Most other solutions need to read all data into memory.

Of course you can also use perl/python to re-implement the awk line. I use bioawk only because it has a built-in fasta/fastq parser.

ADD COMMENTlink written 6.7 years ago by lh331k
1

This is without replacement.

ADD REPLYlink written 6.7 years ago by lh331k
1

No, this is not true. Reservoir sampling may seem too simplistic, but it guarantees uniform randomness. You can find the proof via google. It is theoretically the best solution for sampling a fixed number of records without replacement. You even do not need to know the size of reservior.

ADD REPLYlink written 6.7 years ago by lh331k
1

The first time I read about reservoir sampling is when I saw that seqanswers post. By intuition, I think there must be a solution working on a data stream - those who work on endless time series must be faced with such a situation all the time. Nonetheless, I did not know the answer, so I googled and found reservior sampling.

ADD REPLYlink written 6.7 years ago by lh331k

The thing about reservoir sampling is that there is 'decreasing probability', once the k first elements have been filled, that a new sequence will be retained, thus making sequences at the end of the file less likely to be taken and sequences from the beginning of the file more likely to be replaced. How close to random selection will the resulting list of sequences really be? I provide a solution that works with gigantic files, at the cost of having to scan the file twice and doing some extra trimming but ensuring that everybody gets equal chances.

ADD REPLYlink written 6.7 years ago by Eric Normandeau10k

I'll look it up, it is a very interesting solution.

ADD REPLYlink written 6.7 years ago by Eric Normandeau10k

I'm curious, did you learn about reservoir sampling in a book about algorithms or when searching for a solution to a specific problem you had? (Just trying to decide whether to spend time to read through a book on algo ;)

ADD REPLYlink written 6.7 years ago by Eric Normandeau10k
5
gravatar for Neilfws
6.7 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

This page has lots of useful sequence utilities, including a script named random_sequence_reads.pl (zip or tarball to download). It does sampling with/without replacement and seems to work, based on my quick test.

Note: the test.sh script samples all sequences from the input file which is confusing and not useful - edit it to change the parameters.

ADD COMMENTlink written 6.7 years ago by Neilfws48k
3
gravatar for Martin A Hansen
6.7 years ago by
Martin A Hansen3.0k
Denmark
Martin A Hansen3.0k wrote:

This can be done with Biopieces like this:

read_fasta -i lotsofseqs.fasta | random_records -n 10 | write_fasta -o sample.fasta -x
ADD COMMENTlink written 6.7 years ago by Martin A Hansen3.0k

will definitely install this later, after installing all the prereqs, external tools, etc.

ADD REPLYlink written 6.7 years ago by razor160
2
gravatar for Damian Kao
6.7 years ago by
Damian Kao15k
USA
Damian Kao15k wrote:

Here is something quick and dirty in python:

import sys,random
from Bio import SeqIO

num = int(sys.argv[2])

outArray = [record for record in SeqIO.parse(open(sys.argv[1],'r'),'fasta') if random.randint(0,1)]
while len(outArray)!= num:
    newArray = []
    for record in outArray:
        if len(newArray) == num:
            break
        if random.randint(0,1):
            newArray.append(record)
    outArray = newArray

for record in outArray:
    print ">" + record.id
    print record.seq

Use by: python myScript.py in.fasta 10 > subset10.fa

It basically iteratively randomly takes a subset (around half) until it gets to the number of sequences you want. So if you have 1000 sequences, it will take randomly around 500 of that, then 250 of that, 125....so on the so forth until it gets to the number if you want.

This script will probably only work well if you have a large amount of sequences (1000+) and the subset you want is well below half of the total amount.

ADD COMMENTlink written 6.7 years ago by Damian Kao15k

This is not a random subsampling. It systematically is biased to sequences at the beginning of the file.

ADD REPLYlink written 5.5 years ago by xapple230
2
gravatar for Malcolm.Cook
6.7 years ago by
Malcolm.Cook900
kansas, usa
Malcolm.Cook900 wrote:

A little late to the fray, but anyway...

Here's a solution for sampling fasta without replacement.

It uses bash functions, and GNU coreutils' shuf command, but does not suffer the memory requirements of keeping the sequences in memory since it shuffles only the byte offsets into the file where the sequences begin.

function fastaSubsample () {
    ## PURPOSE: choose [HTML] random records from a (presumably
    ## well-formed) fasta file in [HTML].  The output sequences
    ## are in same order as input (due to the `sort -n`)
    n="$1"
    path="$2"
    shift 2
    fastaBreaks ${path} | shuf -n ${n} | sort -n | fastaRecordsAt ${path}
}

function fastaBreaks () {
    ## PURPOSE: output the byte positions of each fasta record on STDIN
    perl -ne 'BEGIN{$told=0}; print($told.qq{\n}) if m/^>/;$told=tell();' "$@"
}

function fastaRecordsAt () {
    ## PURPOSE: extract the fasta records from# [HTML] starting at
    ## each byte offset appearing on STDIN.
    path="$1"
    shift 1
    perl -anse 'BEGIN{open(PATH,$path)}; chomp;  $told = $_; seek(PATH,$told,0); print(scalar(readline(PATH))); while (($l=readline(PATH)) && ($l !~ m/^>/)) {print($l)} ' -- -path=$path
}
ADD COMMENTlink written 6.7 years ago by Malcolm.Cook900
1
gravatar for jmun7616
6.5 years ago by
jmun761610
jmun761610 wrote:

Here is another Python solution, useful for processing multiple file simultaneously (must be in same dir):

import random
import glob
inFiles = glob.glob('*.fas')
outFiles = []
num = int(raw_input("Enter number of random sequences to select:\n"))


for i in range(len(inFiles)):

        for k in range(3):
                fNames = []
                fSeqs = []
                outFiles.append(file(str(inFiles[i])+'_'+'Rand_'+str(num)+'-'+str(k+1)+'.fasta', 'wt'))
                for line in open(inFiles[i]):
                        if (line[0] == '>'):
                                fNames.append(line)
                        else:
                                fSeqs.append(line)
                curr = (len(outFiles)-1)
                for j in range(num):
                        a = random.randint(0, (len(fNames)-1))
                        outFiles[curr].write(fNames.pop(a))
                        outFiles[curr].write(fSeqs.pop(a))
raw_input("Done.")

Basically you just enter how large the subsample should be, and it will give you three random subsamplings (without replacement). Of course if you only need one then only use one. If the file has less sequences than request this script will die. I found this useful in a 16S tagged pyrosequencing project I've been working on.

ADD COMMENTlink modified 6.5 years ago • written 6.5 years ago by jmun761610
1
gravatar for Alex Reynolds
4.7 years ago by
Alex Reynolds26k
Seattle, WA USA
Alex Reynolds26k wrote:

I offer one solution Extracting Random Sequences From A Large Fasta File Using Bioperl to sample without replacement, which does not require any subsampling and uses roughly 24 bytes per FASTA sequence.

The tradeoff is that one has to read through the file at least twice, the first time to read in byte offsets and a second and subsequent time for each random sample. But it isn't necessary to read in any of the FASTA set into memory.

ADD COMMENTlink modified 4.7 years ago • written 4.7 years ago by Alex Reynolds26k
1
0
gravatar for mdtoney
26 days ago by
mdtoney0
mdtoney0 wrote:

Here is some python I wrote that I have used with good success:

import sys,random from Bio import SeqIO from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord from Bio.Alphabet import generic_protein

Use: python scriptname.py number_of_random_seq infile.fasta outfile.fasta

infile = sys.argv[2] #Name of the input file seq = list(SeqIO.parse(infile,"fasta")) #Create a list with all the sequence records print "Input fasta file = ", infile

totseq = len(seq) #Total number of sequences in the input file print "Number of sequences in the original file = ", len(seq)

randseq = int(sys.argv[1]) #Number of random sequences desired print "Number of random sequences desired = ", randseq

if randseq > totseq: print "The requested number of random sequences is greater that the total number of input sequences. Exiting." exit()

outfile = sys.argv[3] #Name of the output file print "Output fasta file = ", outfile

outrandseq = [] outlist = [] print "Randomly chosen output sequences:"

for i in range(randseq): choose = random.randint(1,totseq-1) #Choose a random sequence record number for j in range(len(outrandseq)): #Test to see if the random sequence record number has already been chosen if choose == outrandseq[j]: choose = random.randint(1,totseq-1) #Choose a new random sequence record number if the current one has already been chosen outrandseq.append(choose) print choose outseq = seq[choose] outlist.append(outseq) #Append seq record to output list

SeqIO.write(outlist, outfile, "fasta") #Write the output list to the outfile

exit()

ADD COMMENTlink written 26 days ago by mdtoney0

Wow, that didn't show up correctly!

ADD REPLYlink written 26 days ago by mdtoney0
0
gravatar for mdtoney
26 days ago by
mdtoney0
mdtoney0 wrote:
import sys,random
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import generic_protein

# Use: python   scriptname.py   number_of_random_seq   infile.fasta   outfile.fasta

infile = sys.argv[2]                                #Name of the input file
seq = list(SeqIO.parse(infile,"fasta"))             #Create a list with all the sequence records
print "Input fasta file = ", infile

totseq = len(seq)                                   #Total number of sequences in the input file
print "Number of sequences in the original file = ", len(seq)

randseq = int(sys.argv[1])                          #Number of random sequences desired
print "Number of random sequences desired = ", randseq

if randseq > totseq:
  print "The requested number of random sequences is greater that the total number of input sequences. Exiting."
  exit()

outfile = sys.argv[3]                               #Name of the output file
print "Output fasta file = ", outfile

outrandseq = []
outlist = []
print "Randomly chosen output sequences:"

for i in range(randseq):
  choose = random.randint(1,totseq-1)               #Choose a random sequence record number
  for j in range(len(outrandseq)):                  #Test to see if the random sequence record number has already been chosen
    if choose == outrandseq[j]:
      choose = random.randint(1,totseq-1)           #Choose a new random sequence record number if the current one has already been chosen
  outrandseq.append(choose)
  print choose
  outseq = seq[choose]
  outlist.append(outseq)                            #Append seq record to output list

SeqIO.write(outlist, outfile, "fasta")              #Write the output list to the outfile

exit()
ADD COMMENTlink written 26 days ago by mdtoney0

OK, finally figured out how to make it look right.

ADD REPLYlink written 26 days ago by mdtoney0

I'm a part-time hack, so this can probably be done much better.

ADD REPLYlink written 26 days ago by mdtoney0
0
gravatar for mdtoney
25 days ago by
mdtoney0
mdtoney0 wrote:
import sys,random
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet import generic_protein

# I use this from small numbers of sequences (input file up to 10000 sequences) and it works fine. 
# For very large sequence sets it may be too slow -- I just have not tried.

# Use: python   scriptname.py   number_of_random_seq   infile.fasta   outfile.fasta

infile = sys.argv[2]                                #Name of the input file
seq = list(SeqIO.parse(infile,"fasta"))             #Create a list with all the sequence records
print "Input fasta file = ", infile

totseq = len(seq)                                   #Total number of sequences in the input file
print "Number of sequences in the original file = ", totseq

numrandseq = int(sys.argv[1])                       #Number of random sequences desired
print "Number of random sequences desired = ", numrandseq

if numrandseq > totseq:
  print "The requested number of random sequences is greater that the total number of input sequences. Exiting."
  exit()

outfile = sys.argv[3]                               #Name of the output file
print "Output fasta file = ", outfile

outrandseqset = []
i = 1
for i in range(numrandseq):                         #Create a list of random sequence record numbers for output
  choice = random.randint(1,totseq)
  outrandseqset.append(choice)

i = 1
j = 1
duplicate = 1
while duplicate:                                    #Make sure no sequences are duplicated in the list
    duplicate = 0
    for i in range(numrandseq):
      for j in range(i+1, numrandseq):
        if outrandseqset[i] == outrandseqset[j]:
            outrandseqset[j] = random.randint(1,totseq)
            duplicate = 1


i = 1
print "Randomly chosen output sequences:"
for i in range(numrandseq):
  print outrandseqset[i]

outlist = []
i = 1
for i in range(numrandseq):                         #Create the list of seq records to be written to the output file
  seqnum = outrandseqset[i]
  outseq = seq[seqnum]
  outlist.append(outseq)

SeqIO.write(outlist, outfile, "fasta")              #Write the output list to the outfile

exit()
ADD COMMENTlink written 25 days ago by mdtoney0

New version of my script with loop to make sure there are no duplicates in the output file.

ADD REPLYlink written 25 days ago by mdtoney0

Yea, I'm sure there is some naive/unnecessary stuff in here, but I don't want to hear your pedantics.

ADD REPLYlink written 25 days ago by mdtoney0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1253 users visited in the last hour