Question: How to efficiently remove a list of reads from BAM file?
2
gravatar for Tao
4.2 years ago by
Tao370
Tao370 wrote:

Hi Guys,

I have a BAM file, and a big read list. What I want to do is to remove the reads in the read list from the BAM file. I can transform Bam to Sam file and then use a Python script to remove unwanted reads. And then transform Sam to Bam again. But I am wondering if there is a more efficient way, which I mean faster, easier, and memory-efficient, to achieve this goal?

Any advice is appreciated!

Tao

rna-seq sam samtools bam • 7.1k views
ADD COMMENTlink modified 4.2 years ago by Brian Bushnell17k • written 4.2 years ago by Tao370
7
gravatar for Pierre Lindenbaum
4.2 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum127k wrote:

picard FilterSamReads

READ_LIST_FILE (File) Read List File containing reads that will be included or excluded from the OUTPUT SAM or BAM file. Default value: null.

ADD COMMENTlink modified 3 months ago by RamRS26k • written 4.2 years ago by Pierre Lindenbaum127k

Exciting! That's what I'm looking for!

Many Thanks! Pierre.

ADD REPLYlink modified 3 months ago by RamRS26k • written 4.2 years ago by Tao370
2
gravatar for Vivek
4.2 years ago by
Vivek2.4k
Denmark
Vivek2.4k wrote:
samtools view -h sample1.bam | grep -vf read_ids_to_remove.txt | samtools view -bS -o sample1_filter.bam -​

http://genometoolbox.blogspot.com/2013/06/remove-list-of-reads-from-bam-file.html

ADD COMMENTlink modified 3 months ago by RamRS26k • written 4.2 years ago by Vivek2.4k
3

Hi Vivek,

Thanks for your nice reply. I also found this method, but unfortunately, it's kind of some inefficient. I think may be 'grep' is always doing global search. Here, given a read ID, grep has to confirm all the bam context doesn't contain this read ID(pattern). So it's much slower when a big list of unwanted reads occur. While in my script, I only need to confirm the first column(reads ID) of the Bam file doesn't contain those unwanted reads. But, anyway, your method is a good choice to handle small read list. Thanks!

Tao

ADD REPLYlink modified 3 months ago by RamRS26k • written 4.2 years ago by Tao370
1

this can be made faster by adding the -@ parameter in samtools view. Given a multicore machine, utilize threads to make samtools (implemented in newer versions) faster.

ADD REPLYlink modified 3 months ago by RamRS26k • written 4.2 years ago by Amitm1.9k

That's a good idea!

ADD REPLYlink written 4.2 years ago by Tao370
1
gravatar for Brian Bushnell
4.2 years ago by
Walnut Creek, USA
Brian Bushnell17k wrote:

The fastest and easiest solution is probably to use BBMap + samtools:

filterbyname.sh in=mapped.bam out=filtered.bam names=names.txt include=false

Samtools needs to be in the path. The memory usage depends on the number of names; the speed doesn't (well, not much).

ADD COMMENTlink modified 3 months ago by RamRS26k • written 4.2 years ago by Brian Bushnell17k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 788 users visited in the last hour