Question

Seqtk Job Killed

0

Entering edit mode

11.8 years ago

mad.cichlids ▴ 140

Hi, All

I was trying to use seqtk to randomly pickup 32,000,000 sequences from my sequence pool.

I first combined all of sequences from two samples with :

zcat sub592_1.headcropped.fastq.gz sub592_2.headcropped.fastq.gz > sub592.fastq.gz

then verified that the total number of reads in my file by :

`cat sub592.fastq.gz | echo $((`wc -l`/4))`   #indicate I have  39,965,078

but when I am using seqtk to draw 32million reads from the pool, it did not do the job for me, instead, it just said “killed” at the end, could you help me point to a direction where could it go wrong?

seqtk sample ~/Desktop/rnaseq/trimmed/sub592.fastq.gz 3.2e+7 > s592_32.fastq

Thank you very much for your help and have a great day !

rnaseq • 4.3k views

ADD COMMENT • link updated 11.8 years ago by Istvan Albert 102k • written 11.8 years ago by mad.cichlids ▴ 140

score 1 · Answer 1 · 2013-09-26

1

Entering edit mode

11.8 years ago

Istvan Albert 102k

it was killed because the process ended up using too much memory:

$ ~/src/seqtk/seqtk sample
Usage: seqtk sample [-s seed=11] <in.fa> <frac>|<number>

Warning: Large memory consumption for large <number>.

ADD COMMENT • link 11.8 years ago by Istvan Albert 102k

0

Entering edit mode

Thank you so much, Istvan, is there anyway around it? or just find another machine to do the job. Best, B

ADD REPLY • link 11.8 years ago by mad.cichlids ▴ 140

0

Entering edit mode

there have been some posts on this topic but not all of the answers would work for selecting a large subset.

I think some of the linearization methods as described below should work but each requires a shuffle or a sort so might take a while probably:

Selecting random pairs from fastq?

An easier solution (in case you don't need exactly 32 million only something close to that number) is to go through the file and generate a random number for each record and print it out if larger than a threshold (at the level that you need)

ADD REPLY • link 11.8 years ago by Istvan Albert 102k

0

Entering edit mode

something as simple as this would do:

import sys
from random import random

stream = sys.stdin
write = sys.stdout.write

for id in stream:
    seq = stream.next()
    tmp = stream.next()
    qual = stream.next()

    if random() > 0.5:
        write(id)
        write(seq)
        write(tmp)
        write(qual)

it outputs:

$ wc -l sample1.fq 
  400000 sample1.fq

$ python selector.py < sample1.fq | wc -l
  200076

see how it almost splits at 50%. If you need to split two files the same way (paired end files) just set the random seed to be the same for both files.

ADD REPLY • link 11.8 years ago by Istvan Albert 102k