Question: How to efficiently remove a list of reads from BAM file?
2
gravatar for Tao
3.2 years ago by
Tao270
Tao270 wrote:

Hi Guys,

I have a BAM file, and a big read list. What I want to do is to remove the reads in the read list from the BAM file. I can transform Bam to Sam file and then use a Python script to remove unwanted reads. And then transform Sam to Bam again. But I am wondering if there is a more efficient way, which I mean faster, easier, and memory-efficient, to achieve this goal?

Any advice is appreciated!

Tao

rna-seq sam samtools bam • 5.3k views
ADD COMMENTlink modified 3.2 years ago by Brian Bushnell16k • written 3.2 years ago by Tao270
7
gravatar for Pierre Lindenbaum
3.2 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum118k wrote:

picard FilterSamReads : http://broadinstitute.github.io/picard/command-line-overview.html#FilterSamReads

 

READ_LIST_FILE (File)    Read List File containing reads that will be included or excluded from the OUTPUT SAM or BAM file. Default value: null.
ADD COMMENTlink written 3.2 years ago by Pierre Lindenbaum118k

Exciting! That's what I'm looking for!

Many Thanks! Pierre.

 

ADD REPLYlink written 3.2 years ago by Tao270
1
gravatar for Vivek
3.2 years ago by
Vivek2.2k
Denmark
Vivek2.2k wrote:
samtools view -h sample1.bam | grep -vf read_ids_to_remove.txt | samtools view -bS -o sample1_filter.bam -​

http://genometoolbox.blogspot.com/2013/06/remove-list-of-reads-from-bam-file.html

ADD COMMENTlink written 3.2 years ago by Vivek2.2k
2

Hi Vivek,

Thanks for your nice reply. I also found this method, but unfortunately, it's kind of some inefficient. I think may be 'grep' is always doing global search. Here, given a read ID, grep has to confirm all the bam context  doesn't contain this read ID(pattern). So it's much slower when a big list of unwanted reads occur. While in my script, I only need to confirm the first column(reads ID) of the Bam file doesn't contain those unwanted reads. But, anyway, your method is a good choice to handle small read list. Thanks!

Tao

ADD REPLYlink written 3.2 years ago by Tao270
1

this can be made faster by adding the -@ parameter in samtools view. Given a multicore machine, utilize threads to make samtools (implemented in newer versions) faster.

ADD REPLYlink written 3.2 years ago by Amitm1.6k

That's a good idea!

ADD REPLYlink written 3.2 years ago by Tao270
1
gravatar for Brian Bushnell
3.2 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

The fastest and easiest solution is probably to use BBMap + samtools:

filterbyname.sh in=mapped.bam out=filtered.bam names=names.txt include=false

Samtools needs to be in the path.  The memory usage depends on the number of names; the speed doesn't (well, not much).

ADD COMMENTlink written 3.2 years ago by Brian Bushnell16k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1347 users visited in the last hour