Hi, I have a large paired-end dataset in the BAM format and a list of read IDs which belong to a single mate of a pair. What I want to do is to extract their second mates from the whole dataset. Could you please advise me some efficient ways to do this like using, let's say, Bio-SamTools or something like that? Something memory- and time-efficient. Thanks!
The following C++ code should filters only print the reads contained in your file (I've added to my variation toolkit http://code.google.com/p/variationtoolkit )
g++ -O3 -Wall -I path/to/samtool -L path/to/samtool bamgrepreads.cpp -lbam -lz
./a.out -R file_containing_the_reads_name.txt (stdin|bam1 bam2...)
-f INT required flag -F INT filtering flag -R FILE reads file -e only one match per name (goes faster)
I just ran in the same need, and I solved it using picard-tools:
java -jar FilterSamReads.jar INPUT=input.sam FILTER=includeReadList READ_LIST_FILE=reads_list.txt OUTPUT=selected_polym.sam
You can also give a look to this post: http://sourceforge.net/p/samtools/mailman/message/31724848/