Below is my bam file.
GAII05_0002:1:113:7822:3886#0 1187 Chr3 11699950 60 51M = 11700332 433 AAAAAAAATGTAAAACTGCTAAATCTCTCCTCTCTAAAGAACTCGTCCCCG
CCCCCCBBBCCCCCCCCCCCCCCCCCCCCCCCCCCCBAAB??@ACBBCCCD PQ:i:21 SM:i:37 UQ:i:0 MQ:i:37 XQ:i:0RG:Z:H100223_GAII05_0002
GAII05_0002:1:40:13457:15230#0 163 Chr3 11699950 60 51M = 11700332 433 AAAAAAAATGTAAAACTGCTAAATCTCTCCTCTCTAAAGAACTCGTCCCCG CCCCCCBBBCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC PQ:i:21 SM:i:37 UQ:i:0 MQ:i:37 XQ:i:0 RG:Z:H100223_GAII05_0002
GAII05_0002:1:109:7632:9781#0 147 Chr3 11699952 60 51M = 11699616 -387 AAAAAATGTAAAACTGCTAAATCTCTCCNCTCTAAAGAACTCGTCCCCGTC CCCCCCACCCCCCCCCCCCCC3;:;7??&AACCCCCCCCCCCCCCCCCCCC PQ:i:33 SM:i:37 UQ:i:0 MQ:i:37 XQ:i:0 RG:Z:H100223_GAII05_0002
I have the following questions:
1. The first two reads have the same values except for the quality scores in ASCII. Does that mean these two are PCR duplicates? Can I say that they are the same reads?
2. In the fourth column, I have the position of the start my read (11699950) , and the next read's start postion is 11699952. Why is that given my first read's length is around 50bp? Shouldn't it be 11699950+50? Can you please explain this numbering system?
3. How are the reads arranged? Can I say that Read 1(Position: 11699950)----------Read 2(Position: 11699952)-----------------Read 3(Position: 11699953) etc.
The SAM flags (1187, 163, 147) says your first read is PCR or optical duplicate, but not the other two. If I remember correctly, for paired end reads all but the best quality are marked as duplicated if they have identical 5' positions.
I really can't understand your second question, do you think reads should be spaced at intervals corresponding to the read length? That is not the case: assuming "shotgun" libraries, the DNA is fragmented randomly, so a perfect library would reads starts distributed uniformly across all genomic positions (adapt this to the kind of sample you have, e.g., RNA, ChIPseq, etc).