Get Random Alignments From Sam File
1
0
Entering edit mode
9.2 years ago
komal.rathi ★ 4.0k

Hi all,

I have a .sam file and I want to extract random alignments from it.

I googled a bit and there are quite a few tools that have commands to "downsample a sam file" like Picard and GATK. I couldn't really find if these tools would allow me to extract a certain percentage or certain number of alignments from the SAM file.

I also found an alternative and have been able to get the desired results from my sam file. Link: sample sam file

But I would like to do the same thing using Picard or GATK (or any other tool for that matter) because I am "curious" whether there are any tools that can reproduce the results using a single line of code. Any suggestions? I am just asking for suggestions and not for complete answers. I am just inquisitive and even a little hint would be helpful.

For people who think this is a cross posting, I posted this on www.seqanswers.com as well but received no response and which is why I am posting it here. This is my post on seqanswers.com

sam picard gatk • 4.0k views
1
Entering edit mode
0
Entering edit mode

Thanks. Deedee hit it on the head as well.

4
Entering edit mode
9.2 years ago
Dan D 7.3k

Picard's DownsampleSam allows you to extract a percentage. let's say you want to extract 20% of your reads:

java -Xmx2g -jar DownsampleSam.jar INPUT=myBigSam.sam OUTPUT=mySmallSam.sam PROBABILITY=0.2


If you want a certain number of reads, it would be pretty straightforward to write a script. Alternatively, you can run wc -l myBigSam.sam on the command line to get the total number of lines, divide your desired number of reads by that number, and use the result as the value to your PROBABILITY parameter

0
Entering edit mode

Thanks for the prompt answer. I was just exploring Picard and when I set my probability to 0.5 I was expected half of the reads to come up in the output but it gave me "nearly half" of the reads. Which is what got me confused. Thanks again!

0
Entering edit mode

Yes, "random" and "probabilistic" sampling are not usually compatible with "exactly". If you want exact, semi-random downsampling you would need to do something like:

1. chunk n number of reads together.
2. choose k read from the chunk, where k = sample(n)