Question

Position of reads in a bam file

0

Entering edit mode

6.7 years ago

Gene_MMP8 ▴ 240

Below is my bam file.
GAII05_0002:1:113:7822:3886#0 1187 Chr3 11699950 60 51M = 11700332 433 AAAAAAAATGTAAAACTGCTAAATCTCTCCTCTCTAAAGAACTCGTCCCCG CCCCCCBBBCCCCCCCCCCCCCCCCCCCCCCCCCCCBAAB??@ACBBCCCD PQ:i:21 SM:i:37 UQ:i:0 MQ:i:37 XQ:i:0RG:Z:H100223_GAII05_0002
GAII05_0002:1:40:13457:15230#0 163 Chr3 11699950 60 51M = 11700332 433 AAAAAAAATGTAAAACTGCTAAATCTCTCCTCTCTAAAGAACTCGTCCCCG CCCCCCBBBCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC PQ:i:21 SM:i:37 UQ:i:0 MQ:i:37 XQ:i:0 RG:Z:H100223_GAII05_0002
GAII05_0002:1:109:7632:9781#0 147 Chr3 11699952 60 51M = 11699616 -387 AAAAAATGTAAAACTGCTAAATCTCTCCNCTCTAAAGAACTCGTCCCCGTC CCCCCCACCCCCCCCCCCCCC3;:;7??&AACCCCCCCCCCCCCCCCCCCC PQ:i:33 SM:i:37 UQ:i:0 MQ:i:37 XQ:i:0 RG:Z:H100223_GAII05_0002

I have the following questions:
1. The first two reads have the same values except for the quality scores in ASCII. Does that mean these two are PCR duplicates? Can I say that they are the same reads?
2. In the fourth column, I have the position of the start my read (11699950) , and the next read's start postion is 11699952. Why is that given my first read's length is around 50bp? Shouldn't it be 11699950+50? Can you please explain this numbering system?
3. How are the reads arranged? Can I say that Read 1(Position: 11699950)----------Read 2(Position: 11699952)-----------------Read 3(Position: 11699953) etc.

next-gen alignment • 2.7k views

ADD COMMENT • link updated 6.7 years ago by finswimmer 16k • written 6.7 years ago by Gene_MMP8 ▴ 240

0

Entering edit mode

The SAM flags (1187, 163, 147) says your first read is PCR or optical duplicate, but not the other two. If I remember correctly, for paired end reads all but the best quality are marked as duplicated if they have identical 5' positions.

I really can't understand your second question, do you think reads should be spaced at intervals corresponding to the read length? That is not the case: assuming "shotgun" libraries, the DNA is fragmented randomly, so a perfect library would reads starts distributed uniformly across all genomic positions (adapt this to the kind of sample you have, e.g., RNA, ChIPseq, etc).

ADD REPLY • link 6.7 years ago by h.mon 35k

score 1 · Answer 1 · 2017-08-28

1

Entering edit mode

6.7 years ago

finswimmer 16k

Hello,

The first two reads have the same values except for the quality scores in ASCII. Does that mean these two are PCR duplicates? Can I say that they are the same reads?

As they have the same 5' position and there mate/paired read as well (column 8) these are very likely PCR duplicates.

In the fourth column, I have the position of the start my read (11699950) , and the next read's start postion is 11699952. Why is that given my first read's length is around 50bp? Shouldn't it be 11699950+50? Can you please explain this numbering system?

If the library you were sequenced is not amplicon based, the position of the reads and so there overlapp should be random. So if you know the position of one read, you cannot say were the next read in order starts.

How are the reads arranged?

After doing the alignment the reads are unsorted in the bam file. It is recommended to sort this file by position or read name (depending on the analyse you would like to do with this data).

fin swimmer

ADD COMMENT • link 6.7 years ago by finswimmer 16k

0

Entering edit mode

Thanks for answering. we are effectively trying to understand what the original genome looks like. So after aligning (like the bam file given), which is sorted by position, can I say that the read 1 is followed by read 2 followed by read 3 and so on? I am not trying to predict the position of the next read, just ensuring that the next read is what I expect in the original genome sequence. I am sorry if I can't explain the question properly. any suggestion is welcome.

ADD REPLY • link 6.7 years ago by Gene_MMP8 ▴ 240

0

Entering edit mode

Hello,

after sorting by position the reads are sorted, well, by position. But this doesn't help you to understand how the original genome looks like. For that the term you are looking for is VariantCalling. With Variant Calling you get the differences between your sample and the reference sequence you use for the alignment. Beside that you have to check whether the coverage is sufficient.

fin swimmer

ADD REPLY • link 6.7 years ago by finswimmer 16k

0

Entering edit mode

Thanks for making it clear. Will definitely look into variant calling.

ADD REPLY • link 6.7 years ago by Gene_MMP8 ▴ 240