Question

Downsample BAM file to specific amount of reads

2

Entering edit mode

2.7 years ago

kstangline ▴ 80

I'm attempting to downsample a bam file I have to 10 Million reads using seqtk.

For example:

seqtk sample -s100 myfile.bam 10000000 > myfile_10m.bam

However, when I run this, there are 0 lines in my bam file. This doesn't seem to be correct.

I'm curious if there is another way to downsample a BAM file to 10 Million reads?

bam • 11k views

ADD COMMENT • link updated 4 weeks ago by GenoMax 141k • written 2.7 years ago by kstangline ▴ 80

1

Entering edit mode

seqtk (https://github.com/lh3/seqtk) expects FASTA or FASTQ as input.

ADD REPLY • link 2.7 years ago by Kevin Blighe 87k

1

Entering edit mode

reformat.sh from BBMap suite.

reformat.sh in=your.bam out=sampled.bam options

Use one of the sampling options below:

reads=-1                Set to a positive number to only process this many INPUT reads (or pairs), then quit.
skipreads=-1            Skip (discard) this many INPUT reads before processing the rest.
samplerate=1            Randomly output only this fraction of reads; 1 means sampling is disabled.
sampleseed=-1           Set to a positive number to use that prng seed for sampling (allowing deterministic sampling).
samplereadstarget=0     (srt) Exact number of OUTPUT reads (or pairs) desired.
samplebasestarget=0     (sbt) Exact number of OUTPUT bases desired.
                        Important: srt/sbt flags should not be used with stdin, samplerate, qtrim, minlength, or minavgquality.
upsample=f              Allow srt/sbt to upsample (duplicate reads) when the target is greater than input.

Along with any of processing parameters you want:

mappedonly=f            Toss unmapped reads.
unmappedonly=f          Toss mapped reads.
pairedonly=f            Toss reads that are not mapped as proper pairs.
unpairedonly=f          Toss reads that are mapped as proper pairs.
primaryonly=f           Toss secondary alignments.  Set this to true for sam to fastq conversion.
minmapq=-1              If non-negative, toss reads with mapq under this.
maxmapq=-1              If non-negative, toss reads with mapq over this.

ADD REPLY • link 2.7 years ago by GenoMax 141k

0

Entering edit mode

Reduce read count as in select

a). Any 10M reads or

b).only high quality top 10M reads?

ADD REPLY • link 2.7 years ago by Ankit ▴ 500

score 6 · Answer 1 · 2021-08-21

6

Entering edit mode

2.7 years ago

seidel 11k

You can use samtools to randomly sample a bam file:

samtools view -b -s 0.1 foo.bam > sampled.bam

The -s option randomly samples the file and returns the specified fraction of reads. Thus, you would have to calculate what fraction of your file accounts for 10M reads (e.g. if you have 20M, your fraction would be 0.5). The -b option ensures the output is binary (bam format).

ADD COMMENT • link 2.7 years ago by seidel 11k

2

Entering edit mode

For calculation:

reads=10000000
bam=your.bam
fraction=$(samtools idxstats $bam | cut -f3 | awk -v ct=$reads 'BEGIN {total=0} {total += $1} END {print ct/total}')
samtools view -b -s ${fraction} foo.bam > sampled.bam

Note that a) this may not give exactly 10mio reads due to some rounding errors, and b) this will give 10mio reads which in a paired-end bam file equals 5mio "pairs", so for 10mio pairs it would need to be 20000000 for reads.

Does anyone know whether samtools is smart enough to keeps both mates of a pair, I actually do not know?

ADD REPLY • link 2.7 years ago by ATpoint 81k

2

Entering edit mode

From the docs (-s is the shortcut for subsample):

--subsample FLOAT

Output only a proportion of the input alignments, as specified by 0.0 ≤ FLOAT ≤ 1.0, which gives the fraction of templates/pairs to be kept. This subsampling acts in the same way on all of the alignment records in the same template or read pair, so it never keeps a read but not its mate.

ADD REPLY • link 2.7 years ago by seidel 11k

0

Entering edit mode

Hi, thank you for your methods! I have a bam file of paired end reads, how should I consider the read number for paired end? I want to extract 20M out of 30M from a bam file, but when I count that bam file with samtools view |wc -l, the lines are 60M.

ADD REPLY • link 4 weeks ago by baijiangshan9726 • 0

0

Entering edit mode

Each library fragment is sequenced from both ends leading to x2 the number of reads. So decide if you want to get N fragments = 2 x N reads.If there are 30 M reads total representing 15M fragments in your file then you will want 10M fragments if you wish to end up with 20M reads.

ADD REPLY • link 4 weeks ago by GenoMax 141k

score 0 · Answer 2 · 2021-08-23

Another option is with Genozip (disclosure: I am the author).

genozip myfile.bam
genocat myfile.bam.genozip --downsample 100 --output myfile_10m.bam

You can even select which of the "shards" you want (0 to 99 in this case), eg:

genocat myfile.bam.genozip --downsample 100,3 --sam | my-pipeline-tools

As a side benefit, your BAM files will be 2-3X smaller.

Documentation: https://genozip.com

Paper: https://www.researchgate.net/publication/349347156_Genozip_-_A_Universal_Extensible_Genomic_Data_Compressor