Resampling Fastq Sequences Without Replacement
3
1
Entering edit mode
11.1 years ago
prtimilsina ▴ 80

Hello, I want to extract a random sample (without replacement) of 7.5 million fastq sequences from illumina sequencing data that contains about 30 million sequences each in of the reads. I want to extract the same sequence from each of the two files ( e.g., if I extracted sequence no. 31 from read 1, I would want to extract the same sequence from read 2 also). How can I do this? Is there a script or module I can use? Any help would be appreciated. Thanks

fastq paired-end • 4.9k views
ADD COMMENT
7
Entering edit mode
11.1 years ago
lh3 33k

The general technique for sampling exact k elements without replacement on a data stream is reservoir sampling. It is actually very simple. Here is a one-liner to randomly choose k lines from a text file with awk:

cat file.txt | awk -v k=10 '{y=x++<k?x-1:int(rand()*x);if(y<k)a[y]=$0}END{for(z in a)print a[z]}'

It requires to keep at most k lines in memory, not the entire file. For fasta, you can use bioawk as the parser:

awk -c fastx -v k=10 'BEGIN{srand(100)}{y=x++<k?x-1:int(rand()*x);if(y<k)a[y]=">"$name"\n"$seq}END{for(z in a)print a[z]}' seq1.fa.gz > seq1-sub.fa
awk -c fastx -v k=10 'BEGIN{srand(100)}{y=x++<k?x-1:int(rand()*x);if(y<k)a[y]=">"$name"\n"$seq}END{for(z in a)print a[z]}' seq2.fa.gz > seq2-sub.fa

Remember to use identical random seed such as reads are always paired. It is also possible to parse fasta/q with bare awk instead of relying on bioawk.

A simpler and more efficient solution is to use seqtk, the answer I provided in the old post:

seqtk sample -s100 read1.fq 10000 > sub1.fq
seqtk sample -s100 read2.fq 10000 > sub2.fq

To sample 7.5 million sequences, you need about 7.5M*100bp*2=1.5GB memory plus read names, which should be fine in your case.

When you need to sample even more sequences, you can keep the record indices in memory and then extract sequences with a second pass, or use Selecting Random Pairs From Fastq? with sorting, which is slower and needs working disk space but does not require much RAM.

EDIT: actually in your case, I would simply sample 25% (=7.5/30) of reads. You do not get exactly 7.5 million, but the solution is much simpler and sufficiently good in most cases. It is trivial to do that in a variety of ways, or simply use seqtk:

seqtk sample -s100 read1.fq 0.25 > sub1.fq
seqtk sample -s100 read2.fq 0.25 > sub2.fq
ADD COMMENT
0
Entering edit mode

Thank you so much. I really appreciate that :)

ADD REPLY
0
Entering edit mode

Thanks Heng, I learnt the existence of bioawk and seqtk I've never heard about in the past!

ADD REPLY
0
Entering edit mode

OK, I just figured you're the seqtk developer...

ADD REPLY
1
Entering edit mode

... and I added the "bio" part in bioawk.

ADD REPLY
2
Entering edit mode
11.1 years ago
Josh Herr 5.8k

In addition to the 'Selecting Random Pairs From Fastq?' question, this question, 'Choosing Random Set Of Seqs From Larger Set', helped me in the past.

ADD COMMENT
0
Entering edit mode

Thank you, Josh.

ADD REPLY
0
Entering edit mode

My pleasure! Best of luck to you!

ADD REPLY
1
Entering edit mode
ADD COMMENT
0
Entering edit mode

Thank you Ananta.

ADD REPLY

Login before adding your answer.

Traffic: 2819 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6