Seqtk Job Killed
1
0
Entering edit mode
10.6 years ago
mad.cichlids ▴ 140

Hi, All

I was trying to use seqtk to randomly pickup 32,000,000 sequences from my sequence pool.

I first combined all of sequences from two samples with :

zcat sub592_1.headcropped.fastq.gz sub592_2.headcropped.fastq.gz > sub592.fastq.gz

then verified that the total number of reads in my file by :

`cat sub592.fastq.gz | echo $((`wc -l`/4))`   #indicate I have  39,965,078

but when I am using seqtk to draw 32million reads from the pool, it did not do the job for me, instead, it just said “killed” at the end, could you help me point to a direction where could it go wrong?

seqtk sample ~/Desktop/rnaseq/trimmed/sub592.fastq.gz 3.2e+7 > s592_32.fastq

Thank you very much for your help and have a great day !

rnaseq • 3.7k views
ADD COMMENT
1
Entering edit mode
10.6 years ago

it was killed because the process ended up using too much memory:

$ ~/src/seqtk/seqtk sample
Usage: seqtk sample [-s seed=11] <in.fa> <frac>|<number>

Warning: Large memory consumption for large <number>.
ADD COMMENT
0
Entering edit mode

Thank you so much, Istvan, is there anyway around it? or just find another machine to do the job. Best, B

ADD REPLY
0
Entering edit mode

there have been some posts on this topic but not all of the answers would work for selecting a large subset.

I think some of the linearization methods as described below should work but each requires a shuffle or a sort so might take a while probably:

Selecting random pairs from fastq?

An easier solution (in case you don't need exactly 32 million only something close to that number) is to go through the file and generate a random number for each record and print it out if larger than a threshold (at the level that you need)

ADD REPLY
0
Entering edit mode

something as simple as this would do:

import sys
from random import random

stream = sys.stdin
write = sys.stdout.write

for id in stream:
    seq = stream.next()
    tmp = stream.next()
    qual = stream.next()

    if random() > 0.5:
        write(id)
        write(seq)
        write(tmp)
        write(qual)

it outputs:

$ wc -l sample1.fq 
  400000 sample1.fq

$ python selector.py < sample1.fq | wc -l
  200076

see how it almost splits at 50%. If you need to split two files the same way (paired end files) just set the random seed to be the same for both files.

ADD REPLY

Login before adding your answer.

Traffic: 1884 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6