I have bam files for reads mapped to reference genome. I want to find all reads that correspond to the sequences that are not presented in the reference genome. As far as I understand, these are all reads that are either not mapped to the reference or mapped not concordantly.
I found here http://broadinstitute.github.io/picard/explain-flags.html that my FLAG should be 12 (read unmapped, mate unmapped), could someone please tell me whether this is correct?
all reads that correspond to the sequences that are not presented in the reference genome ... (read unmapped, mate unmapped)
Are not the same. It is possible to have a read pair in which only one read maps to the reference. What would you like to do with those?
If you just want unmapped reads then flag 4 (read unmapped) will extract all the reads not aligned to the reference which sounds like your initial requirement.
NB: If you want to extract all reads pair in which at least one of the reads contains sequence not found in the reference then it's more complicated as not only as you need to extract the read pairs in which only one read in the pair maps, you'd also need to extract partially aligned ("soft clipped") reads and insertions (although, in general, you can't tell whether a small insertion is reference sequence due to the short length) based on the CIGAR string of the alignment.
You are right, I want all read pair in which at least one of the reads is not mapped. Maybe the easiest way is to generate SAM file and filter the reads with flags 83,89,147,163, that correspond to situations when both reads are mapped? Is it correct strategy?
I am looking for long sequences only, that are not presented in genome, so I believe it is ok not to take CIGAR into account.
You are right, I want all read pair in which at least one of the reads is not mapped. Maybe the easiest way is to generate SAM file and filter the reads with flags 83,89,147,163, that correspond to situations when both reads are mapped? Is it correct strategy? I am looking for long sequences only, that are not presented in genome, so I believe it is ok not to take CIGAR into account.
In that case, something like the following should work:
It extracts the reads with only the read mapped, then only then mate mapped, then neither of them mapped and merges then back into a single output.
Edit:
Is a simpler solution. It selects all reads in which both are mapped, discards the output, then writes the remaining reads to unmapped.bam