Question: How to reduce reads in fastq, bam, bed files in a random manner?
1
gravatar for trakhtenberg
4.5 years ago by
trakhtenberg150
United States
trakhtenberg150 wrote:

Hello, what command could I use to extract every 3rd or every 4th pair of reads from two fastq files corresponding to pair-end reads (file read1 and file read2)? I want to use it for getting smaller raw data files, which would be representative of the larger files, since they would be reduced randomly.
Also, the same question regarding how to randomly reduce bam and bed files, such as accepted_hits.bam and junctions.bed generated by tophat?

Thank you,
Ephraim Trakhtenberg

rna-seq • 4.3k views
ADD COMMENTlink modified 4.5 years ago by Alex Reynolds27k • written 4.5 years ago by trakhtenberg150
1

The approach you mention is not technically random, but this task is a pretty common one. There has been plenty of good discussion of this site about sampling large files, for example, Selecting Random Pairs From Fastq? (and see the 'similar posts' section to the right of that page for more posts on the topic). Hope that helps.

ADD REPLYlink written 4.5 years ago by SES8.1k
1
gravatar for Alex Reynolds
4.5 years ago by
Alex Reynolds27k
Seattle, WA USA
Alex Reynolds27k wrote:

To uniformly sample a third of your paired reads (without replacement) from a paired fastq file(*), via bash and sample:

$ fq_count=`calc "$(wc -l < all_reads.fastq) / 8"`
$ samples=`calc "floor($fq_count / 3)"`
$ sample --lines-per-offset=8 --sample-size=${samples} all_reads.fastq > random_sample.fastq

(* - A paired fastq file would be one interleaved file that stores paired reads consecutively, one pair after its mate.) 

Similarly, to sample a BED file:

$ sample --sample-size=12345 foo.bed > random_sample.bed

Or if you have sufficient memory:

$ shuf foo.bed | head -n12345 > random_sample.bed
ADD COMMENTlink modified 4.5 years ago • written 4.5 years ago by Alex Reynolds27k
1

What is calc? Is that a newer builtin, or something else? Google did not turn up anything so I'm curious.

ADD REPLYlink written 4.5 years ago by SES8.1k
1

The calc tool is not a builtin, but I am using bash notation for shell commands. Source is available here, though it is probably available in convenient package form with your favorite package manager.

ADD REPLYlink written 4.5 years ago by Alex Reynolds27k
1

A minor point, recent versions of shuf will do memory-efficient reservoir sampling with the -n argument, so you don't need to use head.

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by matted7.0k
1

I think expr $fq_count / 3 will give you the floor by default, so you could replace all "calc" with "expr" in this case.

 

ADD REPLYlink written 4.5 years ago by Matt Shirley8.8k
1
gravatar for Cytosine
4.5 years ago by
Cytosine440
Ljubljana, Slovenia
Cytosine440 wrote:

I believe a truly random approach is already implemented in samtools for bam file extraction.

The option -s enables you to subsample a bam file.

 

-s FLOAT  Fraction of templates/pairs to subsample; the inte‐
                           ger part is treated as the seed for the random num‐
                           ber generator [-1]

 

samtools view -s 0.25 -b mymapping.bam > random_25%_of_mymapping.bam

ADD COMMENTlink written 4.5 years ago by Cytosine440
1

I would change the 0 to something more interesting though, like
samtools view -s 23241.25 -b mymapping.bam > random_25%_of_mymapping.bam
For the RNG to subsample a bit more randomly.

 

ADD REPLYlink written 4.5 years ago by John12k

I had 167115 reads in the input BAM, so 25% is 41779: using 0.25 gets 42134; using 23241.25 gets 41567 (which is closer to 41779 but still 212 off). I would appreciate a clarification on how could I produce output closer to the specified percent, and a brief explanation on why replacing 0 with 23241 made the output closer to 25%, and how could it increase randomness? Also, just to confirm, in both approaches if reads are paired, they are randomly reduced also in pairs (just to make sure the reads are not selected randomly regardless of their pairing). Thank you.

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by trakhtenberg150
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1400 users visited in the last hour