Question

Split BAM file into two subsamples

1

Entering edit mode

8.2 years ago

colin.kern ★ 1.1k

I want to create two random subsamples from a BAM file. I don't want to just do "samtools view -s 0.5" twice because I want to put half of the reads into one file and half in the other, without replacement. Is there an easy way to do this? I've considered using samtools to output the alignments in SAM format and pipe to the shuf and split commands, I think I'd need to do some extra things to keep the paired-end reads together. Anyone have any ideas?

alignment • 8.9k views

ADD COMMENT • link updated 2.8 years ago by xiaoleiusc ▴ 140 • written 8.2 years ago by colin.kern ★ 1.1k

0

Entering edit mode

"since reads are four lines " uh ?

ADD REPLY • link 8.2 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Nevermind, I was thining of fastq

ADD REPLY • link 8.2 years ago by colin.kern ★ 1.1k

0

Entering edit mode

Hi, Colin,

I have the same question. I wonder if you already have a solution to solve this problem?

Regards,

Xiao

ADD REPLY • link 2.8 years ago by xiaoleiusc ▴ 140

1

Entering edit mode

6.4 years ago

brian.l.hill ▴ 10

I have created a Python script to do this:

Script to split input BAM file randomly (without replacement) into two BAM files

Hopefully this is useful.

ADD COMMENT • link 6.4 years ago by brian.l.hill ▴ 10

0

Entering edit mode

Hi, Brian, Thanks for sharing the script. I tried it but I have the error as below: python split_bam.py -b 1-A5_1.bam -o1 1-A5_1_part1.bam -o2 1-A5_1_part2.bam

File "split_bam.py", line 18 print "ERROR: BAM file does not exist!" ^ SyntaxError: Missing parentheses in call to 'print'. Did you mean print("ERROR: BAM file does not exist!")?

I would appreciate it if you could give me some inputs. My bam file is from pair-end reads alignment.

Regards,

Xiao

ADD REPLY • link 2.8 years ago by xiaoleiusc ▴ 140

0

Entering edit mode

8.2 years ago

Folder40g ▴ 190

Not the best way but a fast one could be grep all names of reads in sam file to a new file. With sort -R option shuffle them and then grep or intersect the n numbers names shuffled in this file created with the bam file.

You first need to convert the bam into sam.

Not sure if I explained myself well enough.

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 8.2 years ago by Folder40g ▴ 190

Ram · Accepted Answer · 2016-02-05

4

Entering edit mode

8.2 years ago

Devon Ryan 104k

Assuming single-end reads, something like this should work:

samtools view -H foo.bam > f1.sam
cp f1.sam f2.sam
samtools view foo.bam | awk '{if(NR%2){print >> "f1.sam"} else {print >> "f2.sam"}}'

I'm sure there's a one line version of that, but this is simple enough.

For paired-end reads, just change NR%2 to NR%4<2. Again, there's likely a bug in there, but it's something to start with.

ADD COMMENT • link updated 5.0 years ago by Ram 43k • written 8.2 years ago by Devon Ryan 104k

0

Entering edit mode

Won't this work?

wc -l file.sam # line_number/4, say we've got 1000
sed -n 1,500p file.sam > first_half.sam
sed -n 501,1000p file.sam > second_half.sam

ADD REPLY • link updated 5.0 years ago by Ram 43k • written 8.2 years ago by venu 7.1k

1

Entering edit mode

"file.sam" would contain a header, which I assume should be kept completely. Aside from that, sure, one could use sed instead. There are many ways to skin this proverbial cat. I'd personally use python with pysam, since then it'd be simple to handle singletons and such.

ADD REPLY • link 8.2 years ago by Devon Ryan 104k

0

Entering edit mode

Yeah, I would've taken care of it if I thought of header. I just wanted to confirm what I was trying to solve this problem was correct. Thank you.

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 8.2 years ago by venu 7.1k

0

Entering edit mode

Does this do a random sample? If I'm understanding how it works, it just alternates putting reads into each file, which if the bam is sorted wouldn't really be random.

ADD REPLY • link 8.2 years ago by colin.kern ★ 1.1k

0

Entering edit mode

You would name sort the BAM file first, the results would then be random enough.

ADD REPLY • link 8.2 years ago by Devon Ryan 104k