Hello, I have 40,637 short sequences (probes) in a fastq file named "seq.fq".
- First, I mapped them against a reference genome (hg19) using BWA ("bwa aln ...").
- Then, I converted the alignments from suffix-array coordinates into chromosomal coordinates ("bwa samse ..."), and obtained the results into a SAM file named "seq_aln.sam".
Finally, I counted the number of occurrences for each flag:
$ grep -v "@" seq_aln.sam | awk -F"\t" 'BEGIN{print "flag\toccurrences"} {a[$2]++} END{for(i in a)print i"\t"a[i]}'
flag......occurrences
4.........3083
0.........19039
16.......18515
According to this page, the "4" flag means that the short sequence doesn't map onto the reference genome, and the "16" flag means that the short sequence does map on the reverse strand of the reference genome.
But, what does the "0" flag mean? According to this forum page, it means "the read is not paired and mapped, forward strand", which is unclear to me... Does it mean "it is not paired but it maps on forward strand"? Or "it is neither paired nor maps on forward strand"? Or "it is neither paired nor maps on any strand"?
At the end, does all this mean that I can work with only 18,515 short sequences out of 40,637?
Thanks for your help!