Hi everyone,
I found some DMPs from a few samples, and have now a list of probes. However, these probes are probably from different parts of the genome and I am only interested in the ones that are found in promoters. I was wondering if anyone had a recommendation on how to do this. I have the coordinates of these probes from a manifest provided from Illumina, but I am stuck on how to proceed from here.
Thank you!
Get a reference annotation (GTF) file for your organism and extract the transcription start sites of every gene. As a proxy for promoters one might use something like -200bp to the TSS. Be sure to respect the strands, so if a gene is on the top strand then the start coordinate is the actual start site, if on the bottom, the end coordinate is the actual gene start. From there on, simply intersect your promoters and DMPs. Be sure that the coordinates from the DMP and the reference annotation is based on the same genome assembly, e.g for human both hg19.
Thank you for your advice! It is really helpful!
Would you know also how to categorize the probes by which region they are found (body, 5'UTR, intron, etc...)? The manifest has a column that explains this, but it is a bit confusing. For example, one region is explained to be "5'UTR;TSS200;TSS200;Body".
I like to use
annotatr
to get some basic annotations (http://bioconductor.org/packages/release/bioc/vignettes/annotatr/inst/doc/annotatr-vignette.html#annotationhub-annotations). But yeah, annotations are not exclusive. If something is a 5'UTR it is also in the gene body and it is an exon and it is close to the TSS. You'll have to decide for what you need. I think the most basic annotation classification is (3'UTR, 5'UTR, exon, intron, intergenic, promoter).