Question

How to reduce reads in fastq, bam, bed files in a random manner?

1

Entering edit mode

10.1 years ago

trakhtenberg ▴ 160

Hello, what command could I use to extract every 3rd or every 4th pair of reads from two fastq files corresponding to pair-end reads (file read1 and file read2)? I want to use it for getting smaller raw data files, which would be representative of the larger files, since they would be reduced randomly.
Also, the same question regarding how to randomly reduce bam and bed files, such as accepted_hits.bam and junctions.bed generated by tophat?

Thank you,
Ephraim Trakhtenberg

RNA-Seq • 8.3k views

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by trakhtenberg ▴ 160

1

Entering edit mode

The approach you mention is not technically random, but this task is a pretty common one. There has been plenty of good discussion of this site about sampling large files, for example, Selecting Random Pairs From Fastq? (and see the 'similar posts' section to the right of that page for more posts on the topic). Hope that helps.

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by SES 8.6k

Ram · Answer 1 · 2014-08-20

1

Entering edit mode

10.1 years ago

Alex Reynolds 35k

To uniformly sample a third of your paired reads (without replacement) from a paired fastq file(*), via bash and sample:

$ fq_count=`calc "$(wc -l < all_reads.fastq) / 8"`
$ samples=`calc "floor($fq_count / 3)"`
$ sample --lines-per-offset=8 --sample-size=${samples} all_reads.fastq > random_sample.fastq

(* - A paired fastq file would be one interleaved file that stores paired reads consecutively, one pair after its mate.)

Similarly, to sample a BED file:

$ sample --sample-size=12345 foo.bed > random_sample.bed

Or if you have sufficient memory:

$ shuf foo.bed | head -n12345 > random_sample.bed

ADD COMMENT • link updated 5.0 years ago by Ram 44k • written 10.1 years ago by Alex Reynolds 35k

1

Entering edit mode

What is calc? Is that a newer builtin, or something else? Google did not turn up anything so I'm curious.

ADD REPLY • link 10.1 years ago by SES 8.6k

1

Entering edit mode

The calc tool is not a builtin, but I am using bash notation for shell commands. Source is available here, though it is probably available in convenient package form with your favorite package manager.

ADD REPLY • link 10.1 years ago by Alex Reynolds 35k

1

Entering edit mode

A minor point, recent versions of shuf will do memory-efficient reservoir sampling with the -n argument, so you don't need to use head.

ADD REPLY • link 10.1 years ago by matted 7.8k

1

Entering edit mode

I think expr $fq_count / 3 will give you the floor by default, so you could replace all "calc" with "expr" in this case.

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by Matt Shirley 10k

Ram · Answer 2 · 2014-08-21

1

Entering edit mode

10.1 years ago

Cytosine ▴ 460

I believe a truly random approach is already implemented in samtools for bam file extraction.

The option -s enables you to subsample a bam file.

-s FLOAT  Fraction of templates/pairs to subsample; the inte‐
                           ger part is treated as the seed for the random num‐
                           ber generator [-1]

samtools view -s 0.25 -b mymapping.bam > random_25%_of_mymapping.bam

ADD COMMENT • link updated 5.0 years ago by Ram 44k • written 10.1 years ago by Cytosine ▴ 460

1

Entering edit mode

I would change the 0 to something more interesting though, like

samtools view -s 23241.25 -b mymapping.bam > random_25%_of_mymapping.bam

For the RNG to subsample a bit more randomly.

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 10.1 years ago by John 13k

0

Entering edit mode

I had 167115 reads in the input BAM, so 25% is 41779: using 0.25 gets 42134; using 23241.25 gets 41567 (which is closer to 41779 but still 212 off). I would appreciate a clarification on how could I produce output closer to the specified percent, and a brief explanation on why replacing 0 with 23241 made the output closer to 25%, and how could it increase randomness? Also, just to confirm, in both approaches if reads are paired, they are randomly reduced also in pairs (just to make sure the reads are not selected randomly regardless of their pairing). Thank you.

ADD REPLY • link 10.1 years ago by trakhtenberg ▴ 160