Question: Sampling with replacement from a bam file
0
gravatar for harleen.sn
9 months ago by
harleen.sn10
United States
harleen.sn10 wrote:

Hi,

I would like to randomly sample with replacement from bam files. One way to do that would be to convert the bam files to sam format and use python's np.random.choice. Is there a more direct way to do this that doesn't require bam to sam conversion? Something like samtools view -s but with replacement?

Thank you for your help!

ADD COMMENTlink modified 9 months ago by Alex Reynolds27k • written 9 months ago by harleen.sn10
2

You got me curious, why do you want to sample with replacement?

ADD REPLYlink written 9 months ago by h.mon24k

I am exploring my data and wanted to measure the shot noise or variance in counts (number of reads per genomic region) upon resampling with replacement. As expected, it followed a Poisson distribution, which was nice to see. I can now add this variance to the biological variance for downstream analyses.

ADD REPLYlink written 8 months ago by harleen.sn10

Could you explain what do you mean by replacement?

ADD REPLYlink written 9 months ago by venu6.0k
1

By sampling with replacement, I mean that one could draw the same read more than once by random chance, as opposed to sampling without replacement, where a read can only be drawn once from a population, which is what I think samtools view -s does.

For example, if I wanted to randomly sample 10 numbers between 1-10, without replacement I would get 1,2,3,4,5,6,7,8,9,10 whereas with replacement I might get, 1,1,2,4,4,5,7,7,8,9.

ADD REPLYlink written 9 months ago by harleen.sn10

Something like this

ADD REPLYlink written 9 months ago by bioExplorer3.7k
3
gravatar for Devon Ryan
9 months ago by
Devon Ryan88k
Freiburg, Germany
Devon Ryan88k wrote:

In python:

import pysam
import numpy as np

bam = pysam.AlignmentFile("something.bam")
total = bam.mapped
selected = sorted(np.random.choice(total, size=1000))  # sample 1000 reads, with replacement
idxSelected = 0
idxBam = 0
for read in bam:
    if read.is_unmapped:
        continue
    if idxSelected > len(selected):
        break
    while idxBam = selected[idxSelected]:
        ... do something ...
        idxSelected += 1
    idxBam += 1
ADD COMMENTlink written 9 months ago by Devon Ryan88k

Thanks, Devon! I was able to build on your code with minor modifications. Using pysam was a good idea. I just had to remember to keep my bam index files in the same directory!

ADD REPLYlink written 9 months ago by harleen.sn10
3
gravatar for Alex Reynolds
9 months ago by
Alex Reynolds27k
Seattle, WA USA
Alex Reynolds27k wrote:

Assuming that you can't find a tool that works natively with BAM without converting to SAM, I have a tool called sample that uses reservoir sampling to pull an ordered or randomly-shuffled uniformly random sample from an input text file delimited by newline characters. It can sample with or without replacement.

$ ./sample --help
sample
  version: 1.0.2
  author:  Alex Reynolds

Usage: sample [--sample-size=n] [--lines-per-offset=n] [--sample-without-replacement | --sample-with-replacement] [--shuffle | --preserve-order] [--hybrid | --mmap | --cstdio] [--rng-seed=n] <newline-delimited-file>

...
  --sample-with-replacement     | -r      Sample with replacement (optional)
...

You could run sample on a SAM file to sample with replacement from that file.

ADD COMMENTlink written 9 months ago by Alex Reynolds27k
1
gravatar for Alex Reynolds
9 months ago by
Alex Reynolds27k
Seattle, WA USA
Alex Reynolds27k wrote:

For fun, here's another generic approach that uses seq, shuf, and sed to sample a text file with replacement:

$ LINES=`wc -l reads.sam | awk '{print $1}'`
$ N=1234
$ for LINE in `seq ${LINES} | shuf -r -n ${N}`; do sed "${LINE}q;d" reads.sam >> sample.sam; done

The seq command generates a list of line numbers, shuf -r samples N of those line numbers with replacement, and sed prints out the line, which is appended to the output.

Jumping around the input file like this might be kinda slow. To get better performance, maybe pipe the list of line numbers to sort -n and then sed on that, to read out an in-order sample:

$ for LINE in `seq ${LINES} | shuf -r -n ${N} | sort -n`; do sed "${LINE}q;d" reads.sam >> sample.sam; done
ADD COMMENTlink modified 9 months ago • written 9 months ago by Alex Reynolds27k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2113 users visited in the last hour