Question: Split BAM file into two subsamples
0
gravatar for colin.kern
19 months ago by
colin.kern130
United States
colin.kern130 wrote:

I want to create two random subsamples from a BAM file. I don't want to just do "samtools view -s 0.5" twice because I want to put half of the reads into one file and half in the other, without replacement. Is there an easy way to do this? I've considered using samtools to output the alignments in SAM format and pipe to the shuf and split commands, I think I'd need to do some extra things to keep the paired-end reads together. Anyone have any ideas?

alignment • 1.1k views
ADD COMMENTlink modified 19 months ago by Devon Ryan70k • written 19 months ago by colin.kern130

"since reads are four lines " uh ?

ADD REPLYlink written 19 months ago by Pierre Lindenbaum98k

Nevermind, I was thining of fastq

ADD REPLYlink written 19 months ago by colin.kern130
3
gravatar for Devon Ryan
19 months ago by
Devon Ryan70k
Freiburg, Germany
Devon Ryan70k wrote:

Assuming single-end reads, something like this should work:

samtools view -H foo.bam > f1.sam
cp f1.sam f2.sam
samtools view foo.bam | awk '{if(NR%2){print >> "f1.sam"} else {print >> "f2.sam"}}'

I'm sure there's a one line version of that, but this is simple enough.

For paired-end reads, just change NR%2 to NR%4<2. Again, there's likely a bug in there, but it's something to start with.

ADD COMMENTlink modified 19 months ago • written 19 months ago by Devon Ryan70k

Won't this work?

wc -l file.sam # line_number/4, say we've got 1000

sed -n 1,500p file.sam > first_half.sam

sed -n 501,1000p file.sam > second_half.sam
ADD REPLYlink written 19 months ago by venu4.3k
1

"file.sam" would contain a header, which I assume should be kept completely. Aside from that, sure, one could use sed instead. There are many ways to skin this proverbial cat. I'd personally use python with pysam, since then it'd be simple to handle singletons and such.

ADD REPLYlink written 19 months ago by Devon Ryan70k

Yeah, I would've taken care of it if I thought of header. I just wanted to confirm what I was trying to solve this problem was correct. Thank you. 

ADD REPLYlink written 19 months ago by venu4.3k

Does this do a random sample? If I'm understanding how it works, it just alternates putting reads into each file, which if the bam is sorted wouldn't really be random.

ADD REPLYlink written 19 months ago by colin.kern130

You would name sort the BAM file first, the results would then be random enough.

ADD REPLYlink written 19 months ago by Devon Ryan70k
0
gravatar for runnerbio88
19 months ago by
runnerbio8870
runnerbio8870 wrote:

Not the best way but a fast one could be grep all names of reads in sam file to a new file. With sort -R option shuffle them and then grep or intersect the n numbers names shuffled in this file created with the bam file.

Yuo first need to convert the bam into sam.

 

Not sure if I explained myself well enough.

ADD COMMENTlink written 19 months ago by runnerbio8870
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1477 users visited in the last hour