Question

Extracting Subsets Of Reads From A Bam File

8

Entering edit mode

11.8 years ago

ijon.tichy ▴ 80

Hi, I have a large paired-end dataset in the BAM format and a list of read IDs which belong to a single mate of a pair. What I want to do is to extract their second mates from the whole dataset. Could you please advise me some efficient ways to do this like using, let's say, Bio-SamTools or something like that? Something memory- and time-efficient. Thanks!

bam samtools sequencing • 19k views

ADD COMMENT • link updated 19 months ago by Ram 43k • written 11.8 years ago by ijon.tichy ▴ 80

0

Entering edit mode

how many reads-IDs do you have, does it fit in memory ?

ADD REPLY • link 11.8 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Just now it's only about 200. But for other datasets it could be much more, 10^3-10^5.

ADD REPLY • link 11.8 years ago by ijon.tichy ▴ 80

5

Entering edit mode

9.4 years ago

Fabio Marroni ★ 3.0k

I just ran in the same need, and I solved it using picard-tools:

java -jar FilterSamReads.jar INPUT=input.sam FILTER=includeReadList READ_LIST_FILE=reads_list.txt OUTPUT=selected_polym.sam

You can also give a look to this post.

ADD COMMENT • link updated 4.5 years ago by Ram 43k • written 9.4 years ago by Fabio Marroni ★ 3.0k

1

Entering edit mode

11.8 years ago

Geparada ★ 1.5k

If you're a python programmer, take a look to Pysam. There is a function called "mate" that can do what you're looking for.

ADD COMMENT • link 11.8 years ago by Geparada ★ 1.5k

0

Entering edit mode

11.8 years ago

swbarnes2 14k

samtools view | grep -f

would work, though I can't vouch for how fast it will be.

It might be faster to grep out those reads from the fastq, and just realign them.

ADD COMMENT • link updated 4.5 years ago by Ram 43k • written 11.8 years ago by swbarnes2 14k

0

Entering edit mode

That's exactly what I want to avoid. My datasets are very large and 'grep' takes a lot of time.

ADD REPLY • link 11.8 years ago by ijon.tichy ▴ 80

0

Entering edit mode

3.0 years ago

jueritation • 0

Same as previous answers but another option to run Picard script with GATK. It works for me.

gatk FilterSamReads \
-I input.bam\
-O output.bam \
--READ_LIST_FILE read_names.txt \
--FILTER includeReadList

ADD COMMENT • link 3.0 years ago by jueritation • 0

Ram · Accepted Answer · 2012-07-03

8

Entering edit mode

11.8 years ago

Pierre Lindenbaum 161k

2018: use picard https://broadinstitute.github.io/picard/command-line-overview.html#FilterSamReads

The following C++ code should filters only print the reads contained in your file (I've added to my variation toolkit http://code.google.com/p/variationtoolkit )

http://code.google.com/p/variationtoolkit/source/browse/trunk/src/bamgrepreads.cpp

Compilation:

g++ -O3 -Wall -I  path/to/samtool -L path/to/samtool bamgrepreads.cpp -lbam -lz

Usage:

./a.out -R file_containing_the_reads_name.txt (stdin|bam1 bam2...)

Options:

 -f INT   required flag
 -F INT   filtering flag
 -R FILE reads file
 -e  only one match per name (goes faster)

ADD COMMENT • link 6.1 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thank you! It works great!

ADD REPLY • link 11.8 years ago by ijon.tichy ▴ 80

0

Entering edit mode

Does this take advantage of the queryname sorting order of a .bam file, or does it simply search through all the reads? Thanks!

ADD REPLY • link 11.0 years ago by Isaac Joseph ▴ 150

1

Entering edit mode

no, it searches 'all' the reads in the bam.

ADD REPLY • link 11.0 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Hi, I am trying to use this script, but somehow I can not compile it. seems the samtools I tried is not suitable for this script. Any suggestion that may be helpful?

Here are two error messages of my efforts.

shaoy@Darwin ~/rd/script $ g++ -O3 -Wall -I ~/rd/software/samtools-0.1.19 -L ~/rd/software/samtools-0.1.19 bamgrepreads.cpp -lbam -lz In file included from /usr/local/lib/gcc/x86_64-unknown-linux-gnu/4.7.2/../../../../include/c++/4.7.2/ext/hash_set:61:0, from bamgrepreads.cpp:15: /usr/local/lib/gcc/x86_64-unknown-linux-gnu/4.7.2/../../../../include/c++/4.7.2/backward/backward_warning.h:33:2: warning: #warning This file includes at least one deprecated or antiquated header which may be removed without further notice at a future date. Please use a non-deprecated interface with equivalent functionality instead. For a listing of replacement headers and interfaces, consult the file backward_warning.h. To disable this warning use -Wno-deprecated. [-Wcpp] /home/shaoy/rd/software/samtools-0.1.19/libbam.a(bgzf.o): In function `bgzf_mt': /home/shaoy/rd/software/samtools-0.1.19/bgzf.c:445: undefined reference to `pthread_create' /home/shaoy/rd/software/samtools-0.1.19/libbam.a(bgzf.o): In function `mt_destroy': /home/shaoy/rd/software/samtools-0.1.19/bgzf.c:458: undefined reference to `pthread_join' collect2: error: ld returned 1 exit status

shaoy@Darwin ~/rd/script $ g++ -O3 -Wall -I ~/rd/software/samtools-1.0 -L ~/rd/software/samtools-1.0 bamgrepreads.cpp -lbam -lz In file included from /usr/local/lib/gcc/x86_64-unknown-linux-gnu/4.7.2/../../../../include/c++/4.7.2/ext/hash_set:61:0, from bamgrepreads.cpp:15: /usr/local/lib/gcc/x86_64-unknown-linux-gnu/4.7.2/../../../../include/c++/4.7.2/backward/backward_warning.h:33:2: warning: #warning This file includes at least one deprecated or antiquated header which may be removed without further notice at a future date. Please use a non-deprecated interface with equivalent functionality instead. For a listing of replacement headers and interfaces, consult the file backward_warning.h. To disable this warning use -Wno-deprecated. [-Wcpp] In file included from bamgrepreads.cpp:41:0: /home/shaoy/rd/software/samtools-1.0/bam.h:48:25: fatal error: htslib/bgzf.h: No such file or directory compilation terminated.

ADD REPLY • link updated 4.5 years ago by Ram 43k • written 8.6 years ago by shaoyicharlie ▴ 10

1

Entering edit mode

Whoa that's an old post. This C++ package is now obsolete, you'd better have a look at: https://github.com/lindenb/jvarkit/wiki/SamGrep

ADD REPLY • link updated 19 months ago by Ram 43k • written 8.6 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

and later: picard contains a tool to filter the reads using a list of read names.

ADD REPLY • link 8.2 years ago by Pierre Lindenbaum 161k