Question

Trying to understand the details of an ENCODE project and how to work with it for DNA alignment

1

Entering edit mode

8.9 years ago

Bilal Akil ▴ 30

I'm still rather new with bioinformatics and as such I've been trying to establish my own "process" of sorts in regards to finding, understanding and working with DNA samples, particularly in regards to obtaining raw reads for alignment and then maybe SNP searching afterwards.

I used ENCODE's browsing features to reach the following experiment: https://www.encodeproject.org/experiments/ENCSR000DPV/

I've tried to break down the information displayed there into the key, important parts for me. I've listed some of that beneath the "Understanding" header below. Could you please fill in any gaps or correct me where my understanding is lacking or simply wrong?

Also, beneath the "Questions" header, I've requested information on some specific things that I'm struggling with. If you could provide answers to any of those questions, I'd be very appreciative.

Thank you in advance.

Understanding

It's "ChIP-seq", not "RNA-seq", so for alignment I don't need to use TopHat - Bowtie 2 on it's own should work fine.
It's mapped against the hg19 reference genome, so if I wish to attain similar mapping results I should use that genome too (or rather, I should just always use it since it's nowadays' standard?).
1. As per this Biostars post, a link to download the hg19 reference in use by the 1000 genomes project (which is obviously the good stuff) can be found here: NCBI FTP.
2. It seems that the reads from the experiment aren't related to any particular chromosome (since I can't find anything mentioning chromosomes), so for alignment I should just download the .fasta for the entire hg19 genome (currently human_g1k_v37.fasta.gz) instead of for any particular chromosome.
3. The full hg19 reference appears to have bunches of lines simply reading "NNNNNNNNNNNN...". I assume this is to act as a separation (or "gap") between chromosomes. This will not be detrimental to alignment.
This experiment does NOT used paired-end reads. I don't need to worry about --split-files or anything like that.
There are 2 "biological samples". This means that ChIP-seq was performed twice on similar cells to produce two sets of reads for reliability - they weren't just programmatically replicated. I can choose to use either one, or both of them in my work.
1. The raw sequence reads that I'd want to use for alignment are the .fastq files with an "Output type" of "reads" in the table of linked files. They're not already aligned, right?
The experiment has a "Control". This is due to the presence of another attribute: "Antibody". I guess this means an antibody was used on the subject and so the DNA with the antibody is being compared to the control - that without the antibody.

Questions

One of the attributes for the experiment, "Target", has a value "CTCF". Does this mean the reads and this experiment in general is focussed on a particular gene that's labelled "CTCF"? If not a gene, what is "CTCF"?
The size of the hg19 reference I downloaded, once extracted, is 2.9Gb. However the size of the reads from the experiment is far less than that. Since reads also include overlaps, this means that only a small portion of the human genome is covered by these reads. How can I tell which portion of the genome is in subject, possibly so I can download only that section/those chromosome(s) of the genome instead of the whole thing? Is this based on the "Target" attribute?
What, if any, are the relationships between the "Target" and the "Antibody"? Was the antibody somehow applied specifically to the target?
It is my understanding that with the ENCODE project, the provided .fastq reads are very "raw" - i.e. submitted almost directly from the sequencing machines once they've finished, without much modification (see https://www.encodeproject.org/help/file-formats/). Is it recommended that I trim or perform any other operations on these reads before I use them for alignment? What operations are usually necessary?

ChIP-Seq alignment sequencing genome ENCODE • 2.8k views

ADD COMMENT • link updated 15 months ago by Ram 43k • written 8.9 years ago by Bilal Akil ▴ 30

Ram · Answer 1 · 2015-05-24

So this question seems to be more about the biological background of ChIP-Seq than analysis problems. Let me try to summarize first generally and then answer the questions specifically.

ChIP-Seq, or chromatin immunoprecipitation and deep sequencing, is a method used to find protein-DNA interaction patterns. Usually the first step is crosslinking proteins and DNA in vivo using e.g. formaldehyde, followed by lysis. This is then followed by some method of shearing the DNA in small fragments, which are suitable for your sequencing method. Using immunoprecipitation (protein enrichment by antibody binding) the complexes of a target protein are enriched from the previously completely random mix of protein-DNA complexes. The protein-portion of the complex is then digested, DNA is purified, linkers are ligated and it is sequenced.

thus:

The target is the protein, for which interaction patterns are supposed to be found. It is what the antibody will specifically bind. In this case CTCF is the name of a zinc-finger DNA-binding protein that acts as a transcritption factor. This means a CTCF-specific antibody was used after shearing the isolated DNA to enrich for DNA fragments crosslinked to the CTCF protein.
It seems this question is a general misunderstanding of ChIP-Seq. ChIP-Seq is not used to assemble genomes. As such you will of course not have complete coverage of the genome with reads, but rather locally confined "hot spots" at places where CTCF was crosslinked to the DNA (and thus likely recognizes DNA). Reads will be obtained for all chromosomes most likely, but only at places of specific interaction. This pattern is based on the target. More generally, the file size of the genome and the file size of the read-container will not give you any clue as to how much coverage you have for multiple obvious reasons (different file formats, read quality, etc etc). You can use tools like FastQC to get better impressions of the quality of the reads obtained, but you will only know after alignment how your coverage looks like.
As stated previously the antibody is specific for the target and enriches DNA-target complexes from the initial pool of all cellular DNA-protein complexes. The target-enriched pool of DNA-protein complexes is sequenced.
Usually deep-sequencing data has to be pre-processed to remove linker sequences, and low quality reads or basepairs from the library. Box 1 in this paper on ChIP-seq talks a bit about pre-processing and tools you can use for that purpose. If you are using already published data, it makes sense to look into the paper/metadata for pre-processing steps that were taken and to reproduce those.