Question: Split BAM file into two subsamples
0
gravatar for colin.kern
2.8 years ago by
colin.kern180
United States
colin.kern180 wrote:

I want to create two random subsamples from a BAM file. I don't want to just do "samtools view -s 0.5" twice because I want to put half of the reads into one file and half in the other, without replacement. Is there an easy way to do this? I've considered using samtools to output the alignments in SAM format and pipe to the shuf and split commands, I think I'd need to do some extra things to keep the paired-end reads together. Anyone have any ideas?

alignment • 2.2k views
ADD COMMENTlink modified 11 months ago by brian.l.hill0 • written 2.8 years ago by colin.kern180

"since reads are four lines " uh ?

ADD REPLYlink written 2.8 years ago by Pierre Lindenbaum114k

Nevermind, I was thining of fastq

ADD REPLYlink written 2.8 years ago by colin.kern180
3
gravatar for Devon Ryan
2.8 years ago by
Devon Ryan86k
Freiburg, Germany
Devon Ryan86k wrote:

Assuming single-end reads, something like this should work:

samtools view -H foo.bam > f1.sam
cp f1.sam f2.sam
samtools view foo.bam | awk '{if(NR%2){print >> "f1.sam"} else {print >> "f2.sam"}}'

I'm sure there's a one line version of that, but this is simple enough.

For paired-end reads, just change NR%2 to NR%4<2. Again, there's likely a bug in there, but it's something to start with.

ADD COMMENTlink modified 2.8 years ago • written 2.8 years ago by Devon Ryan86k

Won't this work?

wc -l file.sam # line_number/4, say we've got 1000

sed -n 1,500p file.sam > first_half.sam

sed -n 501,1000p file.sam > second_half.sam
ADD REPLYlink written 2.8 years ago by venu5.7k
1

"file.sam" would contain a header, which I assume should be kept completely. Aside from that, sure, one could use sed instead. There are many ways to skin this proverbial cat. I'd personally use python with pysam, since then it'd be simple to handle singletons and such.

ADD REPLYlink written 2.8 years ago by Devon Ryan86k

Yeah, I would've taken care of it if I thought of header. I just wanted to confirm what I was trying to solve this problem was correct. Thank you. 

ADD REPLYlink written 2.8 years ago by venu5.7k

Does this do a random sample? If I'm understanding how it works, it just alternates putting reads into each file, which if the bam is sorted wouldn't really be random.

ADD REPLYlink written 2.8 years ago by colin.kern180

You would name sort the BAM file first, the results would then be random enough.

ADD REPLYlink written 2.8 years ago by Devon Ryan86k
0
gravatar for Folder40g
2.8 years ago by
Folder40g110
Folder40g110 wrote:

Not the best way but a fast one could be grep all names of reads in sam file to a new file. With sort -R option shuffle them and then grep or intersect the n numbers names shuffled in this file created with the bam file.

Yuo first need to convert the bam into sam.

 

Not sure if I explained myself well enough.

ADD COMMENTlink written 2.8 years ago by Folder40g110
0
gravatar for brian.l.hill
11 months ago by
brian.l.hill0 wrote:

I have created a Python script to do this:

Script to split input BAM file randomly (without replacement) into two BAM files

Hopefully this is useful.

ADD COMMENTlink written 11 months ago by brian.l.hill0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1322 users visited in the last hour