Question: Split BAM file into two subsamples
0
gravatar for colin.kern
2.0 years ago by
colin.kern150
United States
colin.kern150 wrote:

I want to create two random subsamples from a BAM file. I don't want to just do "samtools view -s 0.5" twice because I want to put half of the reads into one file and half in the other, without replacement. Is there an easy way to do this? I've considered using samtools to output the alignments in SAM format and pipe to the shuf and split commands, I think I'd need to do some extra things to keep the paired-end reads together. Anyone have any ideas?

alignment • 1.5k views
ADD COMMENTlink modified 10 weeks ago by brian.l.hill0 • written 2.0 years ago by colin.kern150

"since reads are four lines " uh ?

ADD REPLYlink written 2.0 years ago by Pierre Lindenbaum104k

Nevermind, I was thining of fastq

ADD REPLYlink written 2.0 years ago by colin.kern150
3
gravatar for Devon Ryan
2.0 years ago by
Devon Ryan76k
Freiburg, Germany
Devon Ryan76k wrote:

Assuming single-end reads, something like this should work:

samtools view -H foo.bam > f1.sam
cp f1.sam f2.sam
samtools view foo.bam | awk '{if(NR%2){print >> "f1.sam"} else {print >> "f2.sam"}}'

I'm sure there's a one line version of that, but this is simple enough.

For paired-end reads, just change NR%2 to NR%4<2. Again, there's likely a bug in there, but it's something to start with.

ADD COMMENTlink modified 2.0 years ago • written 2.0 years ago by Devon Ryan76k

Won't this work?

wc -l file.sam # line_number/4, say we've got 1000

sed -n 1,500p file.sam > first_half.sam

sed -n 501,1000p file.sam > second_half.sam
ADD REPLYlink written 2.0 years ago by venu4.8k
1

"file.sam" would contain a header, which I assume should be kept completely. Aside from that, sure, one could use sed instead. There are many ways to skin this proverbial cat. I'd personally use python with pysam, since then it'd be simple to handle singletons and such.

ADD REPLYlink written 2.0 years ago by Devon Ryan76k

Yeah, I would've taken care of it if I thought of header. I just wanted to confirm what I was trying to solve this problem was correct. Thank you. 

ADD REPLYlink written 2.0 years ago by venu4.8k

Does this do a random sample? If I'm understanding how it works, it just alternates putting reads into each file, which if the bam is sorted wouldn't really be random.

ADD REPLYlink written 2.0 years ago by colin.kern150

You would name sort the BAM file first, the results would then be random enough.

ADD REPLYlink written 2.0 years ago by Devon Ryan76k
0
gravatar for runnerbio88
2.0 years ago by
runnerbio88100
runnerbio88100 wrote:

Not the best way but a fast one could be grep all names of reads in sam file to a new file. With sort -R option shuffle them and then grep or intersect the n numbers names shuffled in this file created with the bam file.

Yuo first need to convert the bam into sam.

 

Not sure if I explained myself well enough.

ADD COMMENTlink written 2.0 years ago by runnerbio88100
0
gravatar for brian.l.hill
10 weeks ago by
brian.l.hill0 wrote:

I have created a Python script to do this:

Script to split input BAM file randomly (without replacement) into two BAM files

Hopefully this is useful.

ADD COMMENTlink written 10 weeks ago by brian.l.hill0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 725 users visited in the last hour