Question

Interpreting Encode Chip-Seq

1

Entering edit mode

11.8 years ago

David ▴ 10

I have a few related questions on how ChIP-Seq transcription factor binding data from ENCODE, obtained via the UCSC website, should be interpreted. I'm trying to understand what's in the BAM files, which contain alignments of bound regions to the genome. There's one BAM file per TF/tissue type/replicate combination. I've pasted a sample output I'm looking at from the BAM viewer that comes with samtools below:

dsimcha@pippin:~/samtools-0.1.18$ ./samtools view wgEncodeSydhTfbsGm12878Ikzf1iknuclaStdAlnRep2.bam  |head -n 5
chr1:10156:+:4201830    0       chr1    10156   255     32M     *       0       0       CTAACCCTAACCCTAACCCTAACCTAACCCTA        *
chr1:10244:-:13881795   16      chr1    10244   255     32M     *       0       0       CCCTAAACCCTAAACCCTAACCCTAACCCTAA        *
chr1:10245:+:12011702   0       chr1    10245   255     32M     *       0       0       CCTAAACCCTAAACCCTAACCCTAACCCTAAC        *
chr1:10248:+:2689289    0       chr1    10248   255     32M     *       0       0       AAACCCTAAACCCTAACCCTAACCCTAACCCT        *
chr1:10248:-:20928150   16      chr1    10248   255     32M     *       0       0       AAACCCCAAACCCTAACCCTAACCCTAACCCT        *

Can each line be interpreted as a potential binding site?
Where is there documentation about what each column means?
Apparently for each BAM file, there are on the order of 10 to 20 million alignment points, i.e. 10-20 million unique coordinates, which I assume correspond to binding sites. Since the human genome is about 3 billion bases, this means that these TFs are binding once every 300 bases if I'm interpreting it right. I guess the vast majority of the binding sites could be non-functional, but I think it's more likely that I'm misinterpreting the data or that it's mostly false positives. Why are there so many binding sites?
Why do multiple entries exist with the same coordinates? How should I interpret this?

chip-seq samtools • 3.4k views

ADD COMMENT • link updated 11.8 years ago by Steve Lianoglou 5.2k • written 11.8 years ago by David ▴ 10

score 3 · Answer 1 · 2012-07-04

"1. Can each line be interpreted as a potential binding site?"

No, each line is where one read from the ChIP-seq experiment aligned to the genome

"2. Where is there documentation about what each column means?"

In the SAM specification document (it's a PDF).

"3 ..."

No, these aren't binding sites, these are just reads from the ChIP-seq experiment. There are several tools, such as MACS or SPP that take reads such as the ones you are looking at and attempts to call regions of "significant binding" from them.

You may find reads from a ChIP-seq experiment that don't necessarily come from the IP itself -- this is just noise inherent to the protocol itself.

"4. Why do multiple entries exist with the same coordinates? How should I interpret this?"

Same answer from q1 and q3.

All that having been said, I bet you can d/l the processed "peaks" from the ENCODE data, but the ENCODE download URLs over at UCSC are hanging on me for now so I unfortunately can't point you directly to them.