Question: Seqtk Job Killed
0
gravatar for mad.cichlids
6.7 years ago by
mad.cichlids110
Texas
mad.cichlids110 wrote:

Hi, All

I was trying to use seqtk to randomly pickup 32,000,000 sequences from my sequence pool.

I first combined all of sequences from two samples with :

zcat sub592_1.headcropped.fastq.gz sub592_2.headcropped.fastq.gz > sub592.fastq.gz

then verified that the total number of reads in my file by :

`cat sub592.fastq.gz | echo $((`wc -l`/4))`   #indicate I have  39,965,078

but when I am using seqtk to draw 32million reads from the pool, it did not do the job for me, instead, it just said “killed” at the end, could you help me point to a direction where could it go wrong?

seqtk sample ~/Desktop/rnaseq/trimmed/sub592.fastq.gz 3.2e+7 > s592_32.fastq

Thank you very much for your help and have a great day !

rnaseq • 2.1k views
ADD COMMENTlink modified 6.7 years ago by Istvan Albert ♦♦ 84k • written 6.7 years ago by mad.cichlids110
1
gravatar for Istvan Albert
6.7 years ago by
Istvan Albert ♦♦ 84k
University Park, USA
Istvan Albert ♦♦ 84k wrote:

it was killed because the process ended up using too much memory:

$ ~/src/seqtk/seqtk sample
Usage: seqtk sample [-s seed=11] <in.fa> <frac>|<number>

Warning: Large memory consumption for large <number>.
ADD COMMENTlink modified 6.7 years ago • written 6.7 years ago by Istvan Albert ♦♦ 84k

Thank you so much, Istvan, is there anyway around it? or just find another machine to do the job. Best, B

ADD REPLYlink written 6.7 years ago by mad.cichlids110

there have been some posts on this topic but not all of the answers would work for selecting a large subset.

I think some of the linearization methods as described below should work but each requires a shuffle or a sort so might take a while probably:

Selecting random pairs from fastq?

An easier solution (in case you don't need exactly 32 million only something close to that number) is to go through the file and generate a random number for each record and print it out if larger than a threshold (at the level that you need)

ADD REPLYlink modified 6.7 years ago • written 6.7 years ago by Istvan Albert ♦♦ 84k

something as simple as this would do:

import sys
from random import random

stream = sys.stdin
write = sys.stdout.write

for id in stream:
    seq = stream.next()
    tmp = stream.next()
    qual = stream.next()

    if random() > 0.5:
        write(id)
        write(seq)
        write(tmp)
        write(qual)

it outputs:

$ wc -l sample1.fq 
  400000 sample1.fq

$ python selector.py < sample1.fq | wc -l
  200076

see how it almost splits at 50%. If you need to split two files the same way (paired end files) just set the random seed to be the same for both files.

ADD REPLYlink modified 6.7 years ago • written 6.7 years ago by Istvan Albert ♦♦ 84k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1018 users visited in the last hour