Hi. I have two questions that might or might not be related.
I want to filter all and only unmapped reads from a pair end file.
I type
samtools view -f 4 myFile.bam
According to samtools, -f 4 should 'only output alignments' where the query itself is unmapped.
However the beginning of the produced output is like
HWI-ST300_0110:7:21:1528:160264#0 117 chr10 67604 0 * = 67604 0 <SEQ> <QUAL>
HWI-ST300_0110:7:7:10624:6076#0 117 chr10 78098 0 * = 78098 0 <SEQ> <QUAL>
HWI-ST300_0110:7:5:15368:185558#0 69 chr10 78778 0 * = 78778 0 <SEQ> <QUAL>
These are not unmapped. The mate is, not the query, right? Later in the output there are also the unmapped reads. With flag -f 12, I correctly get those where both query and mate pair are unmapped. Why is this happening?
The second question is about sequence name. The fastq files I was provided with, have the name like:
@HWI-ST300:130:B08M9ABXX:2:1101:1137:1993 2:Y:0:
with the space where it is shown.
That means I have both pairs that have the very same name in the resulting bam files (HWI-ST300:130:B08M9ABXX:2:1101:1170:1992). Can this be an issue? What are the specifications about sequence names in fastq format? According to this paper spaces don't seem to be an issue, but yet bwa trim after the space.
Thanks
Thanks! Very useful!. But so, they are "officially" unmapped, despite having a mapping field. Confusing. 69 means: read paired, read unmapped, first in pair. Even more confusing. Anyway, I guess I will filter the others with -F 4 to make sure I don't have the same sequence in two different files after splitting.
I think there is no standard designation for what 'unmapped' should mean - I usually take it as in indication that the measurement is unreliable.