Entering edit mode
12 weeks ago
QX
▴
80
Hi all,
I am working on ATAC-seq dataset and I have annotated the peak regions to the TSS with range from [-3000, 3000]. I am using these code to annotate the peak:
peakAnno <- annotatePeak(macsPeaks_GR,
tssRegion=c(-3000, 3000),
TxDb=TxDb.Hsapiens.UCSC.hg38.knownGene,
annoDb="org.Hs.eg.db")
after this, I figure out that some peaks are annotated into the transcripts of the gene, not the gene itself; so not exon 1.
Does this kind of peak make any biological sense in interpretation promoter?
Can I use these peaks for downstream analysis like PCA or GO-term analysis? If not, do you know how can I keep only the canonical genes, not their transcripts?
It's not clear to me what you mean. How could you tell it's annotated to the transcript and not the genic region in the genome?
Generally, peaks mapped to genic exons are a low percentage, so should not affect your analysis too much. I would include them unless you have a specific reason to exclude them.
As for the promoter regions, -/+ 3Kb is a lax definition of promoter. You can decrease these distances, but generally it doesn't change the picture too much unless, again, you are asking very specific questions. I tend to go with -2000, 500 as promoter definition based on genic distance from TSS. I believe a tool like HOMER will look even stricter, like -400,100.