Long story short I've spent about 30 hours over the last two weeks attempting to learn about sequencing, specifically Illumina sequencing, in order to better understand how adapter contamination can occur and exactly where in a read I'm to be looking for it.
To frame this question in the context of my knowledge, prior to getting an RA job with a statistics professor this past October, I never took biology or science courses, excluding AP bio in HS about 10 years ago (I'm 25). My job mostly consisted of maintaining our linux server and the local galaxy instance for our lab (my background is CS and math).
About two months ago, my boss started giving me old fastq files to start learning about adapter trimming and data cleaning based on quality scores.
I've watched a lot of lectures, illumina videos, read countless tutorials and questions asked here, but I can't find a single response that deals with my specific question.
I started with basic sanger sequencing, and then moved on to illumina's sequencing by synthesis. I could probably give a lecture on it in my sleep- fragment your library, blunt the ends, phosphorylation, add the A overhang, ligate adapters, denature the fragment, ends of the adapters bind to oligos on the flow cell, polymerase and free bases are added, the make the complement to the strand, these are denatured, the original strand washes away, the new one bends and attaches to the flow cell with its other end (forming the bridge shape), these get denatured and they both spring upright, this process continues. Finally, the sequencing occurs by starting from the 3' ends of these sequences which are floating free (so that it can occur in the 5' to 3' direction) and the special bases that are fluorescent are added, and after each base is added, the sequencer basically takes a snapshot of the flow cell to identify the bases.
Adapter contamination seems to occur when the machine sequences past the fragment and then into the adapter. This can occur because the insert size is too small (smaller than the read length). Thus, since we sequence in the 5' to 3' direction, we will see adapters near the end of the read (because the fastq file gives the reads from 5' to 3').
So, my question is this:
Using this as my guide http://nextgen.mgh.harvard.edu/IlluminaChemistry.html I see that for read 1, I will get the forward sequence (the one with P5 on the 5' end and P7 on the 3' in the original picture), and, if an adapter were present, it appears that at the end of the read it could pick up some of the complement to the index and possibly the complement to the P7 adapter.
Likewise, on read 2, you get the reverse read, or the fragment with the P5 on the 3' end and P7 on the 5' end. The adapter contamination could occur at the end of this read as the complement to the P5 adapter.
Is this correct? According to the pictures and the chemistry of DNA transcription, I don't see how one could ever possibly sequence the adapter, you can only the get the complement.
So for trimming, I should only look at the 3' ends of my reads for (portions of / possibly the entire) complement to an adapter sequence?
On a different note, what is it that makes only the 3' ends attach to the flowcell surface? If the P5 adapter attached to the 3' end can, then why can't the P5 adapter attached to the 5' end?
I'm at a loss, all help is greatly appreciated.