Question: Get Random Alignments From Sam File
0
gravatar for komal.rathi
6.0 years ago by
komal.rathi3.4k
Children's Hospital of Philadelphia, Philadelphia, PA
komal.rathi3.4k wrote:

Hi all,

I have a .sam file and I want to extract random alignments from it.

I googled a bit and there are quite a few tools that have commands to "downsample a sam file" like Picard and GATK. I couldn't really find if these tools would allow me to extract a certain percentage or certain number of alignments from the SAM file.

I also found an alternative and have been able to get the desired results from my sam file. Link: sample sam file

But I would like to do the same thing using Picard or GATK (or any other tool for that matter) because I am "curious" whether there are any tools that can reproduce the results using a single line of code. Any suggestions? I am just asking for suggestions and not for complete answers. I am just inquisitive and even a little hint would be helpful.

For people who think this is a cross posting, I posted this on www.seqanswers.com as well but received no response and which is why I am posting it here. This is my post on seqanswers.com

Thanks in advance!

gatk picard sam • 3.3k views
ADD COMMENTlink modified 6.0 years ago by Dan D6.8k • written 6.0 years ago by komal.rathi3.4k
1

Matt Shirley posted an answer here A: Get random alignments from SAM file

ADD REPLYlink modified 7 months ago by RamRS24k • written 6.0 years ago by Pierre Lindenbaum123k

Thanks. Deedee hit it on the head as well.

ADD REPLYlink modified 7 months ago by RamRS24k • written 6.0 years ago by Matt Shirley9.1k
4
gravatar for Dan D
6.0 years ago by
Dan D6.8k
Tennessee
Dan D6.8k wrote:

Picard's DownsampleSam allows you to extract a percentage. let's say you want to extract 20% of your reads:

java -Xmx2g -jar DownsampleSam.jar INPUT=myBigSam.sam OUTPUT=mySmallSam.sam PROBABILITY=0.2

If you want a certain number of reads, it would be pretty straightforward to write a script. Alternatively, you can run wc -l myBigSam.sam on the command line to get the total number of lines, divide your desired number of reads by that number, and use the result as the value to your PROBABILITY parameter

ADD COMMENTlink modified 7 months ago by RamRS24k • written 6.0 years ago by Dan D6.8k

Thanks for the prompt answer. I was just exploring Picard and when I set my probability to 0.5 I was expected half of the reads to come up in the output but it gave me "nearly half" of the reads. Which is what got me confused. Thanks again!

ADD REPLYlink written 6.0 years ago by komal.rathi3.4k

Yes, "random" and "probabilistic" sampling are not usually compatible with "exactly". If you want exact, semi-random downsampling you would need to do something like:

  1. chunk n number of reads together.
  2. choose k read from the chunk, where k = sample(n)
ADD REPLYlink modified 7 months ago by RamRS24k • written 6.0 years ago by Matt Shirley9.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 817 users visited in the last hour