I have a few related questions on how ChIP-Seq transcription factor binding data from ENCODE, obtained via the UCSC website, should be interpreted. I'm trying to understand what's in the BAM files, which contain alignments of bound regions to the genome. There's one BAM file per TF/tissue type/replicate combination. I've pasted a sample output I'm looking at from the BAM viewer that comes with samtools below:
dsimcha@pippin:~/samtools-0.1.18$ ./samtools view wgEncodeSydhTfbsGm12878Ikzf1iknuclaStdAlnRep2.bam |head -n 5 chr1:10156:+:4201830 0 chr1 10156 255 32M * 0 0 CTAACCCTAACCCTAACCCTAACCTAACCCTA * chr1:10244:-:13881795 16 chr1 10244 255 32M * 0 0 CCCTAAACCCTAAACCCTAACCCTAACCCTAA * chr1:10245:+:12011702 0 chr1 10245 255 32M * 0 0 CCTAAACCCTAAACCCTAACCCTAACCCTAAC * chr1:10248:+:2689289 0 chr1 10248 255 32M * 0 0 AAACCCTAAACCCTAACCCTAACCCTAACCCT * chr1:10248:-:20928150 16 chr1 10248 255 32M * 0 0 AAACCCCAAACCCTAACCCTAACCCTAACCCT *
Can each line be interpreted as a potential binding site?
Where is there documentation about what each column means?
Apparently for each BAM file, there are on the order of 10 to 20 million alignment points, i.e. 10-20 million unique coordinates, which I assume correspond to binding sites. Since the human genome is about 3 billion bases, this means that these TFs are binding once every 300 bases if I'm interpreting it right. I guess the vast majority of the binding sites could be non-functional, but I think it's more likely that I'm misinterpreting the data or that it's mostly false positives. Why are there so many binding sites?
Why do multiple entries exist with the same coordinates? How should I interpret this?