Question: Extracting Subsets Of Reads From A Bam File
7
gravatar for ijon.tichy
6.8 years ago by
ijon.tichy70
ijon.tichy70 wrote:

Hi, I have a large paired-end dataset in the BAM format and a list of read IDs which belong to a single mate of a pair. What I want to do is to extract their second mates from the whole dataset. Could you please advise me some efficient ways to do this like using, let's say, Bio-SamTools or something like that? Something memory- and time-efficient. Thanks!

bam samtools sequencing • 12k views
ADD COMMENTlink modified 4.4 years ago by Fabio Marroni2.1k • written 6.8 years ago by ijon.tichy70

how many reads-IDs do you have, does it fit in memory ?

ADD REPLYlink written 6.8 years ago by Pierre Lindenbaum119k

Just now it's only about 200. But for other datasets it could be much more, 10^3-10^5.

ADD REPLYlink written 6.8 years ago by ijon.tichy70
7
gravatar for Pierre Lindenbaum
6.8 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum119k wrote:

2018: use picard https://broadinstitute.github.io/picard/command-line-overview.html#FilterSamReads


The following C++ code should filters only print the reads contained in your file (I've added to my variation toolkit http://code.google.com/p/variationtoolkit )

http://code.google.com/p/variationtoolkit/source/browse/trunk/src/bamgrepreads.cpp

Compilation:

g++ -O3 -Wall -I  path/to/samtool -L path/to/samtool bamgrepreads.cpp -lbam -lz

Usage:

./a.out -R file_containing_the_reads_name.txt (stdin|bam1 bam2...)

Options:

 -f INT   required flag
 -F INT   filtering flag
 -R FILE reads file
 -e  only one match per name (goes faster)
ADD COMMENTlink modified 12 months ago • written 6.8 years ago by Pierre Lindenbaum119k

Thank you! It works great!

ADD REPLYlink written 6.8 years ago by ijon.tichy70

Does this take advantage of the queryname sorting order of a .bam file, or does it simply search through all the reads? Thanks!

ADD REPLYlink written 6.0 years ago by Isaac Joseph90
1

no, it searches 'all' the reads in the bam.

ADD REPLYlink modified 6.0 years ago • written 6.0 years ago by Pierre Lindenbaum119k

Hi, I am trying to use this script, but somehow I can not compile it. seems the samtools I tried is not suitable for this script. Any suggestion that may be helpful?

Here are two error messages of my efforts.


shaoy@Darwin ~/rd/script $ g++ -O3 -Wall -I ~/rd/software/samtools-0.1.19 -L ~/rd/software/samtools-0.1.19 bamgrepreads.cpp -lbam -lz In file included from /usr/local/lib/gcc/x86_64-unknown-linux-gnu/4.7.2/../../../../include/c++/4.7.2/ext/hash_set:61:0, from bamgrepreads.cpp:15: /usr/local/lib/gcc/x86_64-unknown-linux-gnu/4.7.2/../../../../include/c++/4.7.2/backward/backward_warning.h:33:2: warning: #warning This file includes at least one deprecated or antiquated header which may be removed without further notice at a future date. Please use a non-deprecated interface with equivalent functionality instead. For a listing of replacement headers and interfaces, consult the file backward_warning.h. To disable this warning use -Wno-deprecated. [-Wcpp] /home/shaoy/rd/software/samtools-0.1.19/libbam.a(bgzf.o): In function `bgzf_mt': /home/shaoy/rd/software/samtools-0.1.19/bgzf.c:445: undefined reference to `pthread_create' /home/shaoy/rd/software/samtools-0.1.19/libbam.a(bgzf.o): In function `mt_destroy': /home/shaoy/rd/software/samtools-0.1.19/bgzf.c:458: undefined reference to `pthread_join' collect2: error: ld returned 1 exit status


shaoy@Darwin ~/rd/script $ g++ -O3 -Wall -I ~/rd/software/samtools-1.0 -L ~/rd/software/samtools-1.0 bamgrepreads.cpp -lbam -lz In file included from /usr/local/lib/gcc/x86_64-unknown-linux-gnu/4.7.2/../../../../include/c++/4.7.2/ext/hash_set:61:0, from bamgrepreads.cpp:15: /usr/local/lib/gcc/x86_64-unknown-linux-gnu/4.7.2/../../../../include/c++/4.7.2/backward/backward_warning.h:33:2: warning: #warning This file includes at least one deprecated or antiquated header which may be removed without further notice at a future date. Please use a non-deprecated interface with equivalent functionality instead. For a listing of replacement headers and interfaces, consult the file backward_warning.h. To disable this warning use -Wno-deprecated. [-Wcpp] In file included from bamgrepreads.cpp:41:0: /home/shaoy/rd/software/samtools-1.0/bam.h:48:25: fatal error: htslib/bgzf.h: No such file or directory compilation terminated.

ADD REPLYlink written 3.6 years ago by shaoyicharlie0
1

whoa that's an old post. This  C++ package is now obsolete, you'd better have a look at : https://github.com/lindenb/jvarkit/wiki/SamGrep

ADD REPLYlink written 3.6 years ago by Pierre Lindenbaum119k

and later: picard contains a tool to filter the reads using a list of read names.

ADD REPLYlink written 3.2 years ago by Pierre Lindenbaum119k
4
gravatar for Fabio Marroni
4.4 years ago by
Fabio Marroni2.1k
Italy
Fabio Marroni2.1k wrote:

I just ran in the same need, and I solved it using picard-tools:

java -jar FilterSamReads.jar INPUT=input.sam FILTER=includeReadList READ_LIST_FILE=reads_list.txt OUTPUT=selected_polym.sam

You can also give a look to this post: http://sourceforge.net/p/samtools/mailman/message/31724848/

 

ADD COMMENTlink written 4.4 years ago by Fabio Marroni2.1k
1
gravatar for Geparada
6.8 years ago by
Geparada1.4k
Cambridge
Geparada1.4k wrote:

If you're a python programmer, take a look to Pysam. There is a function called "mate" that can do what you're looking for.

ADD COMMENTlink modified 6.8 years ago • written 6.8 years ago by Geparada1.4k
0
gravatar for swbarnes2
6.8 years ago by
swbarnes25.3k
United States
swbarnes25.3k wrote:

samtools view | grep -f would work, though I can't vouch for how fast it will be.

It might be faster to grep out those reads from the fastq, and just realign them.

ADD COMMENTlink written 6.8 years ago by swbarnes25.3k

That's exactly what I want to avoid. My datasets are very large and 'grep' takes a lot of time.

ADD REPLYlink written 6.8 years ago by ijon.tichy70
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1202 users visited in the last hour