Question: remove a percent of reads from a BAM file
2
gravatar for sacha
2.6 years ago by
sacha1.7k
France
sacha1.7k wrote:

I would like to remove x% of reads from a BAM file at a specific region . For instance : remove 50% of reads in chr4:1-1000000

What's the most efficient approach ?

bam remove • 833 views
ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by sacha1.7k
0
gravatar for Zaag
2.6 years ago by
Zaag720
Amsterdam
Zaag720 wrote:

This should work:

 java -jar GenomeAnalysisTK.jar \
   -T PrintReads \
   -R reference.fasta \
   -I input.bam \
   -o output.bam \
   -dfrac 0.5 \
   -L chr4:1-1000000

https://software.broadinstitute.org/gatk/documentation/tooldocs/org_broadinstitute_gatk_tools_walkers_readutils_PrintReads.php

ADD COMMENTlink written 2.6 years ago by Zaag720

I prefer something using samtools / bedtools ! But thanks, I will try it

ADD REPLYlink written 2.6 years ago by sacha1.7k

Maybe something like this:

samtools view -sb 0.5 in.bam chr4:1-1000000 > out.bam
ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by Zaag720

Those commands keeps reads . I want to exclude them. I though to remove all reads in chr4:1-1000000, and then merge with your output. Unfortunally, I didn't find how to remove easily reads in the region chr4:1-1000000

ADD REPLYlink written 2.6 years ago by sacha1.7k
1

This removes all the reads in the specified region (-XL option) to include use L, don't know a samtools equivalent:

 java -jar GenomeAnalysisTK.jar \
   -T PrintReads \
   -R reference.fasta \
   -I input.bam \
   -o output.bam \
   -XL chr4:1-1000000
ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by Zaag720

Thanks ! it seems working. But I wonder why the reference is required.

ADD REPLYlink written 2.6 years ago by sacha1.7k
0
gravatar for stam
2.6 years ago by
stam0
stam0 wrote:

Similarly, you can do this with Picard Tools

java -jar ~/tools/picard.jar DownsampleSam I=Input.bam o=Output.bam PROBABILITY=0.5

Or use use samtools view

samtools view -s 0.5 Input.bam > Output.bam

Picardtools has the advantage that it keeps or discards read pairs as a whole. E.g. if one mate is removed/kept, the other is as well.

ADD COMMENTlink written 2.6 years ago by stam0

Seems those commands keeps reads. I want to exclude reads from a region chr4:1-1000000.

ADD REPLYlink written 2.6 years ago by sacha1.7k

Randomly drawing 50% of the reads from a region is the same as randomly discarding 50% of the reads, isn't it?

Both samtools and picard allow you to specify the region as well.

ADD REPLYlink written 2.6 years ago by stam0

Attached a picture of what I want :

https://snag.gy/Hyb4Cu.jpg

ADD REPLYlink written 2.6 years ago by sacha1.7k
0
gravatar for sacha
2.6 years ago by
sacha1.7k
France
sacha1.7k wrote:

To remove reads from a bam in a particular region, I only found the python way using pysam. Here is the pseudo-code

for read in bamfile:
     if read.pos > region_start and (read.pos + read.length) < region_end:
          do_nothing()
     else:
        output.write(read)
ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by sacha1.7k

My version of pysam doesn't have read.length - what is it? If you meant len(read.seq), strictly speaking you should parse the CIGAR string as that takes into consideration insertions/deletions/skips/softclipped bases/etc. Also, obviously, read pairs will be all screwed up as stam says, but perhaps thats OK for your application.

ADD REPLYlink written 2.6 years ago by John12k
1

I m using : read.query_length

ADD REPLYlink written 2.6 years ago by sacha1.7k
1

Yeah, I mean, you can do that by all means, but it won't be as accurate as parsing the CIGAR strings. To be honest, for your tests it'll probably be fine. But this is what i'm talking about:

>>> b
<pysam.calignedsegment.AlignedSegment object at 0x109a18f58>
>>> print b
SOLEXA1_0001:4:87:13822:5416#0  16  0   3299480 0   8M16308N28M -1  -1  36  GAAACTAACACTCTCTCTCTCTCTCTCTCTCTCTCT    array('B', [34, 34, 34, 34, 34, 34, 34, 34, 34, 33, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34, 34])    [('NM', 2), ('XS', '-'), ('NH', 5), ('CC', 'chr10'), ('CP', 85523284)]
>>> print b.query_length
36

So this is a read where, according to the CIGAR, the first 8bp map, then theres a 16,308bp gap, then 28bp map. To get the end position you want the POS + 8 + 16308 + 28, and not POS + 36. However, since reads like this can be rare, you might want to just add code that checks if the read has a weird CIGAR and ask you manually to accept/reject the read. Might be faster in the long run.

I also don't know how pysam handles soft-clipped reads (and for the record, this is exactly why i dislike pysam - it causes more ambiguity than it gives back in abstraction). Perhaps query_length, whatever that is, takes into account soft-clipped reads. Perhaps not. Who knows. But if it does not and you have soft-clipped adapters, you'll have to parse CIGAR strings for sure.

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by John12k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 786 users visited in the last hour