Question: Downsampling By Having A Ceiling On Number Of Reads At Each Site
gravatar for caina89
6.9 years ago by
caina8930 wrote:

Does anyone know how to downsample a high coverage bam file such that it has a ceiling on number of reads at each position? Most of the downsampling tools (Picard, GATK walkers) downsamples the original bam file to a certain fraction of coverage, but that's not suited to my particular purposes. I found the --number option in the GATKCommandLine but for some strange reason it always stops working after about a hundred bases. If anyone has used the --number option in GATK successfully please let me know so I can investigate that more, if not please does anyone have any suggestion? Thanks!

bam coverage • 4.0k views
ADD COMMENTlink modified 6.9 years ago by Chris Miller21k • written 6.9 years ago by caina8930

I'm not sure why the GATK downsample doesn't meet this requirement. If you are using it when variant calling for instance it is downsampling to whatever you specify, the coverage at a position is the number of reads.

ADD REPLYlink written 6.9 years ago by DG7.2k

Hi, thanks for answering! I understand what you're saying, but I'm not using GATK or other variant calling tools to call variants. These reads don't map very well to the reference sequence and I intend to use the reads in the bam files for denovo assembly to get fasta files that can be used in multiple sequence alignment, which would then reveal the "variants" or differences between samples. One problem with that is the coverage (number of reads) per site is different among different samples and also across different sites. As the assemblers may have better information at sites with greater read depths and act different at sites with lower read depths, and there is a systematic difference between groups of samples I have in terms of mean whole sequence read depth, I am worried any variants I get from the multiple sequence alignment may be a result of assembly artefacts. Therefore the hope to downsample everyone's bam files by capping everyone's read depth at every site at a particular number.. Hope I am making sense? Do you have any suggestions how to do this better?

ADD REPLYlink written 6.9 years ago by caina8930
gravatar for Chris Miller
6.9 years ago by
Chris Miller21k
Washington University in St. Louis, MO
Chris Miller21k wrote:

Seems like there ought to be a better way to do it, but the work flow I can come up with off the top of my head is:

1) calculate depth in windows across the genome. (lots of info on doing this in other biostar questions)

2) use samtools to grab all reads from windows below your threshold, stick them in a SAM file.

samtools view -R goodregions.bed file.bam >newfile.sam

3) for each window that exceeded your threshold, calculate the fraction of the reads you want to keep, then downsample those regions accordingly:

samtools view  file.bam 1:1-10000 | sed -n '0~100p' >>newfile.sam

4) sort your sam, convert to bam, and index it.

ADD COMMENTlink modified 6.9 years ago • written 6.9 years ago by Chris Miller21k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1102 users visited in the last hour